MLSN logoLast Update: 2006-09-02

Making A Dictionary From Wikipedia

Wikipedia is wonderful. Not just free, not just full of fascinating information, but also it can be downloaded as XML files for data-mining. Big XML files. Big is not a big enough word: massive XML files. So massive in fact that I did not have the disk space to unzip them. Fortunately unix pipes and PHP's flexible XML parser meant I did not need to.

Wikipedia has what are called interwiki links. These are most commonly used to link an entry in one language to the same entry in another language. I wanted to see if they would help me generate a dictionary to complement Jim Breen's jmdict, so I wrote a script to extract them. You can find the script in the latest fclib (my open source PHP library), in the utilities subdirectory. See the comments in the top of the wikipedia_interwiki_extractor.php file for instructions on how to use it. It works with all languages, not just English and Japanese.

It took 54 minutes to process the English Wikipedia xml file and generated a 53,204 entry 1.8MB file of English-Japanese translations. Generating a 63,527 entry 2.2MB Japanese-English lookup from the Japanese Wikipedia xml file took 8 minutes.

So were they useful? Yes. Using just jmdict I could automatically translate just over 7500 of WordNet's entries, with high or medium confidence. When I used both jmdict and the interwiki dictionary files that increased to over 10,500 entries. The interwiki dictionary is also useful in that it covers lots of culture words: movie names, manga titles, famous people, etc.

In fact it would have been even more useful but many Wikipedia page titles contain text in brackets to disambiguate. That extra text prevents exact matches.

Related links:

Site Menu

  • Introduction; News
  • History And Acknowledgments
  • Current Limitations
  • Wordnet Comparison
  • Adding More Languages
  • Making Dictionary From Wikipedia
  • Privacy
  • License
  • Resources, Links
  • Download Database Dump

    Try It!


  • Back to main page

    © Copyright 2006 Darren Cook (darren@dcook.org)