|Last Update: 2008-12-23|
Adding More Languages
At the time of writing MLSN has five languages (English, Japanese, Chinese, German and Arabic), and all have been setup by me (Darren Cook). People have asked me what is required to add a new language, and this page is intended to answer that question.
For me the biggest block in adding any more languages is developing some reading and writing skill in the language and getting up to speed on computing issues for that language. This is fun, but requires more time than I have, so I am hoping that for each additional new language that one or more people will volunteer to do the initial setup and then take responsibility for that language.
- Some reading/writing skill in the language (being a native speaker is nice, but not essential).
- Some experience in data processing in the language, and good knowledge of challenges peculiar to it.
- Awareness and appreciation of:
- semantic networks
- subtle differences in word meanings
- available dictionary and other resources.
- Strong communication skills in English, as that is the common language of the MLSN project.
- Understanding of what open source is, and the difference between public domain, BSD/MIT licenses (like MLSN), GPL, and the various Creative Common's licenses.
This section is an outline of the steps to set things up. Please ask about what is unclear, and I will explain in more detail (and then update this document!).
- All data sources should be written in UTF-8.
- If there is an existing semantic network for this language (and license is compatible) then write a parser for it (for instance, to make SQL that can be run to import the data quickly.)
- A en-xx (and/or xx-en) dictionary, where xx represents the language, and en means English. Write a parser for it.
- Get latest wikipedia dump and run the script to make a dictionary from it.
- Run the script to take the various dictionary sources and produce high/medium/low confidence translations of the English MLSN table.
- Do the imports (and run the make_index script after doing so).
- External dictionary sources that can be linked to: write the PHP code (in code/main_code.inc).
- The basic search will use exact match. Add any code to do more intelligent search, and partial search. (At the time of writing this is not very modular, and partial search is not available at all.)
- Attribution: add entries to dict.php and main.php to describe the key used, and the original location of the data sources.
Decide the data standards, and what will go in the square brackets. (In future they may be replaced by a more general way to add all kinds of tags to words, but for the moment all attributes go in the square brackets.)
The basic rule: include as much primitive information as possible. Attributes can always be hidden, but is hard to add information later. However, anything that can be created perfectly using an algorithm should not be included, and instead generated as part of the user interface.
Examples for Arabic
- We include all diacritics; they are easy to strip out, but cannot be added automatically.
- The Arabic root can be guessed, but not perfectly. So it is included, and is in the square brackets.
- Buckwalter transliteration is a perfect 1-1 representation of the Arabic, so is not part of the database.
- Other transliterations can be created algorithmically from the Arabic, but are lossless: the Arabic cannot be created reliably from the transliteration. However as we store the Arabic, the lossy conversion can be done on the fly as part of the user interface. So, again, there is no need to include it. (Place names and famous people are exceptions, as then there is often an inconsistent transliteration in general use; for the moment we assume the link to the English version of MLSN will give us those, and in the Arabic MLSN we stick with the logical, consistent transliteration).
Examples for Japanese
- The same word can be in hiragana, katakana and kanji. We use the (vague) rule that any in common use are included. If in doubt it is probably best to include them, as some program could be run later to add usage counts against a large corpus, for instance, to strip out uncommon usage.
- The katakana is always in square brackets, even if the word itself is katakana. For the square brackets the long vowel symbol is never used; instead the explicit vowel character is used.
- Hepburn romaji can be generated algorithmically, so is not included.
Examples for German
- The gender is written after the word in square brackets: [m], [f] or [n].
- Should nouns be capitalized or not? (Currently the database is mixed mess of both.)
- For words that changed spelling in the spelling reforms, include both versions in the database. (For the moment, note which is which in the comments field.)
Examples for Chinese
- Chinese means Simplified Mandarin. For the moment traditional equivalents can be given in the comments field. Later a separate MLSN database will be used for traditional mandarin, and another for Cantonese.
- Pinyin are written in the square brackets
- For no-tone the number 5 is used in the pinyin. No spaces between pinyin. So "ma1ma5", not "ma1 ma5" or "ma1ma" or "ma1 ma". Also, numbers are used in the pinyin, instead of diacritics, for easier input.