Somali morphological analysis progress report

Over the past several months, I’ve been working on a Somali morphological analyzer. It’s rule based, and built with HFST, so it takes a little bit of work to extend, but it runs quite quickly and smoothly. Following are some examples for varying forms of the word baabuur ‘truck’.

baabuur
baabuur baabuur+N+Masc+Indef+Sg+Nom
baabuur baabuur+N+Masc+Indef+Sg+Abs
baabuur baabuur+N+Masc+Indef+Sg+Gen

baabuurro
baabuurro       baabuur+N+Fem+Pl+Indef*

baabuurradii
baabuurradii    baabuur+N+Fem+Def+Pl+Nom+Dist
baabuurradii    baabuur+N+Fem+Def+Pl+Abs+Dist

*Note: Somali has gender polarity for some words, which alternate between Fem. and Masc. in Sg. and Pl.

It can of course be turned around to generate word forms too, if you just input the analysis. This is one of the first stages of rule-based machine translation or text tagging, or what have you, for Somali. Once I’m far enough along with the analyzer, and have gone through and worked out the kinks, I’ll probably work on disambiguating multiple analyses.

Along the way, I’ve been compiling a corpus of news articles with which to test coverage of my analyzer and help extend it… One of the ways I’m working to extend the analyzer with words now is by providing a means to automatically guess which inflectional type a word is. I’m happy to say this is on the way too, but not quite there yet. Either way, the plan is to dump a list of words into the (python) program, and extend the analyzer with those that pass with flying colors. Of course, these aren’t many word categories in the program yet, but I’m fairly confident that I can get decent results.

Following is an example. Each word goes through a list of simple tests, and tests are assessed by count of forms fitting into some phonetic category contained in the word categories.

aalad, aalado, aaladda, aaladdu, aaladdii, aaladaha, aaladuhu, aaladihii
  D1F: 8/8  <--
  D1M: 5/8 
  D2M: 4/8 
  D2F: 4/8 
--
geed, geedka, geedku, geedkii, geedo, geedaha, geedihii, geeduhu
  D1F: 5/8 
  D1M: 8/8  <--
  D2M: 7/8 
  D2F: 1/8 
--
baabuur, baabuurka, baabuurku, baabuurro, baabuurrada, baabuurradii
  D1F: 2/6 
  D1M: 4/6 
  D2M: 6/6  <--
  D2F: 4/6 
--
magac, magaca, magucu, magicii, magacyo, magacyada, magacyadii
  D1F: 2/7 
  D1M: 5/7 
  D2M: 7/7  <--
  D2F: 4/7 
--
subax, subaxda, subaxdu, subaxdii, subaxyo, subaxyada, subaxyadii
  D1F: 5/7 
  D1M: 2/7 
  D2M: 4/7 
  D2F: 7/7  <--
--

And of course, I’m planning on making the source available for these programs as I clean up the source, remove my notes, and provide more useful documentation…

Related …