Darren Cook <darren@dcook.org>


Japanese Site Screen Scraper

Keywords: PHP, screen scraping, POST, curl, Internet Explorer, COM, Mozilla

A project to submit information to two Japanese sites then extract the key information each returns and combine it into a single English page.

This was complicated by the fact that the two sites in question used different encodings (one used Shift JIS one used EUC). This project was done in PHP4, with multibyte extensions and the curl module (for POST-ing the form submission).

Thoughts: If doing this today, and windows platform was an option, I would use the PHP's COM extension for control of Internet Explorer (see the September 05 issue of PHP Architect). I am also investigating similar control of Mozilla but have yet to find something easy. Chickenfoot and Greasemonkey look interesting but are browser plug-ins, so require manual intervention to operate.





 

Work Top Page   *   Personal Home Page   *   Email me at: darren@dcook.org   *   PGP Public Key

Last updated: 9th Dec 2006, © Copyright Darren Cook, 2002-2006.

No Software Patents