Machine Translation/Text Data Mining
Keywords: semantic network, translation, multilingual, thesaurus, data mining, Japanese, Chinese, German, English, Arabic, Wikipedia, context, xml, very large xml files, PHP, C++
I have had a strong interest in machine translation research, on and off, for over fifteen years. A combination of recent developments make me think a dramatic jump in quality will soon be possible: fast hardware, lots of memory and the availability of freely available multilingual databases such as Wikipedia. The open source movement also gives me hope: imagine how quickly a system could learn if a million people checked in translation corrections every single day.
I have a working prototype doing translations between English, Japanese, Chinese, German and Arabic, with a very high level of quality on the test documents. The engine is designed to be easy to adapt to a specific domain. If you are interested please get in touch.
In 2005 I started the MLSN project, an open source multi-lingual semantic network. This started out as needing a Japanese thesaurus, but it quickly expanded into a more ambitious project to store all the relations between words in multiple languages. 2008 saw a major upgrade:
And then 2009 saw Arabic added. At the current time, dictionary sizes are still quite small, so please be forgiving. Quick links to other online dictionaries in all the supported languages are provided.
- Chinese and German added (to the existing Japanese and English)
- Much cleaner user-interface
- 50x quicker searches
My machine translation research ties in closely with my intelligent search work, financial trading strategy work and with my go project. In a sound bite: all are about understanding context.