Japanese Site Screen Scraper
Keywords: PHP, screen scraping, POST, curl, Internet Explorer, COM, Mozilla
A project to submit information to two Japanese sites then extract the key information each returns and combine it into a single English page.
This was complicated by the fact that the two sites in question used different encodings (one used Shift JIS one used EUC). This project was done in PHP4, with multibyte extensions and the curl module (for POST-ing the form submission).
Thoughts: If doing this today, and windows platform was an option, I would use the PHP's COM extension for control of Internet Explorer (see the September 05 issue of PHP Architect). I am also investigating similar control of Mozilla but have yet to find something easy. Chickenfoot and Greasemonkey look interesting but are browser plug-ins, so require manual intervention to operate.
|