Complex CSV Imports into SQL databases
Keywords: csv, SQL, XML, tab separated, data filtering, data mining, Japanese, PHP, C++
I have created many scripts for clients needing to get data (normally in CSV, tab-separated or XML format) into an SQL database (typically for use in web sites, email campaigns, machine translation, keyword suggestion, etc.). These scripts handled fixing dirty data, merge-and-purge (e.g. duplicate email addresses), multiple files in different formats and so on.
Handling different Japanese encodings, and copying with mojibake (illegal characters), is a common requirement.
The simpler scripts have been written in PHP; when dealing with large amounts of data highly optimized C++ has used. I have written scripts for everything from a handful of email addresses in a text file to processing multi-gigabyte XML files.
Frequently this import stage is combined with data filtering, helping to reduce very large files (such as the Wikipedia xml dumps) to manageable slices.