Translation to Qglic with Finite State Transducers

Qglic (pronounced Anglish) is a near-phonemic alternative writing system for English. Being near-phonemic, the goal is to have as close to a one-to-one correspondence between sounds in English and the letters used to represent these. One of the benefits to Qglic is that it attempts to do this using only the letters A through Z. You can see a small sample of it following, which is this paragraph but just written in Qglic.

Qglic iz ey funymik qltrnutiv ruyti’g sistum for I’glic. Byi’g funymik (or nirly so), xu gol iz tu hav ez klos tw ey wun-tu-wun koruspqnduns bitwyn saondz in I’glic and xu letrz ywzd tu reprizent xu saondz. Wun uv xu benufits ti Qglic iz xat it utemps ti dw xis ywzi’g only xu letrz A xrw Z. Yw kan sy u smol sampul uv it fqloi’g, witc iz xis perugraf but dcist ritun in Qglic.

I discovered Qglic a year or so ago, but recently remembered it and became all excited about it again. Using my newly acquired skills in various language technological applications, I spent some time putting together a simple finite-state machine based on the phonemic rules of Qglic, and the CMU Pronouncing Dictionary, which is vast and contains a huge amount of words (approximately 133,000). The CMU Pronouncing Dictionary contains pronunciation guides written with Arpabet, which means it’s fairly easy to translate it into IPA or in this case, Qglic.

ABSCOND  AE0 B S K AA1 N D
ABSCONDED  AE0 B S K AA1 N D AH0 D
ABSCONDING  AE0 B S K AA1 N D IH0 NG 
ABSCONDS  AE0 B S K AA1 N D Z
ABSECON  AE1 B S AH0 K AO0 N
ABSENCE  AE1 B S AH0 N S
ABSENCES  AE1 B S AH0 N S IH0 Z
ABSENT  AE1 B S AH0 N T
ABSENTEE  AE2 B S AH0 N T IY1
ABSENTEEISM  AE2 B S AH0 N T IY1 IH0 Z AH0 M
ABSENTEES  AE2 B S AH0 N T IY1

Taking this data, I wrote a short Python script (I’ll upload it somewhere at some point soon) to translate the pronunciation guides into Qglic, and then convert them to a format used to produce a file format compatible with the Helsinki Finite State Transducer (HFST):

abscond:abskqnd         ennd ;
absconded:abskqndud             ennd ;
absconding:abskqndi'g               ennd ;
absconds:abskqndz               ennd ;
absecon:absukon         ennd ;
absence:absuns          ennd ;
absences:absunsiz               ennd ;

It’s a very simple finite-state machine, as far as the amount of effort put into producing it. It consists of just a huge list of words in the format of english:qglic, which represents a beginning path and the end path in the machine. The result is very fast: a 385 word article on Naomi Campbell testifying before a war-crimes tribunal from CNN is converted to Qglic in just 0.143 seconds, and the whole of The Importance of Being Earnest translates in about 1.3 seconds.

There are still some issues to work out, such as how I tokenize text, so, punctuation isn’t perfect, and thus results in more words not being translated… However, since I’m using the CMU database, there are very few words that don’t make it through, and if they don’t, it’s most likely a result of a tokenization error.

One of the other problems is that words which are homonymous are not handled ideally now (the first homonym is used always), which results in funny spellings when a word is both a noun and a verb (‘The farmers prodúce próduce’) but used as the other (‘*The farmers próduce prodúce.’). Problems like these could be solved with a few more hours of work implementing already existing technologies to disambiguate between the two words based on sentence-sized contexts. If I get a little more time to work on this, maybe I’ll iron those problems out and put some of the larger texts up online that are “translated”.

Instead, enjoy a couple paragraphs of Naomi Campbell’s court case, which has been cleaned up for punctuation issues that I need to fix. Looking through it otherwise, I see there is at least one other issue. See if you can spot it, or find more! ;)

Neyomy Kambul wil testufuy in wor kruymz truyl xrzdey

(cnn) — Ey dcudc in xu wor kruymz truyl uv formr Luybiryun prezidunt Tcqrlz Teylr haz disuydid xat swprmqdul Neyomy Kambulz testumony in xu keys wil go uhed xrzdey.

Xu specul kort uv Syeru Lyon kunfrmd ti syenen wenzdey xat kambul wil teyk xu stand at xu trubywnul, dispuyt an imrdcunsy mocun xu difens fuyld mundey ti diley hr testumony.

Prqsikywtrz sey Teylr geyv Kambul ey duymund dri’g xu wor in Syeru Lyon, kqntrudikti’g Teylrz testumony xat hy nevr handuld xu precus stonz xat fywuld xu kunflikt.