This page lists a sample of projects I've worked on in the past few years. If you have questions about any of these please don't hesitate to get in touch.
Financial Data Handling
Keywords: finance, PHP, R, C++, real-time prices, historic data, data mining, mining, financial news
I have a lot of experience downloading, importing, mining, validating and repairing financial data (from ticks to daily bars). Everything from FX to stocks to indices to derivatives, and dealing with a range of data suppliers in various countries.
I generally use PHP for data manipulation and download tasks and, increasingly, R for data mining tasks. I am using C++ when needing to use proprietary libraries which are only available in C/C++. (I also use C++ for projects where execution speed or memory use is critical.)
See Also: Financial Charting, Trading Strategies.
Keywords: finance, R, statistics, technical analysis, real-time prices, historic data, data mining, mining, high frequency
This is a set of related projects to apply various ideas in data mining, text mining and technical analysis to automated trading strategies in stocks (Japan, U.K., U.S.), futures (Japan, U.S.) and FX trading.
R is the main language for this project, with support from C++ and some PHP scripts.
It is looking very promising so far, and I'm already using some of the algorithms with my own money. I will be announcing results after more extensive backtesting; if you would like to be notified of progress, please send me an email.
Web Mining/Social Mining (facebook, twitter, mixi, etc.)
Keywords: data mining, social media, data scientist, data analyst, web mining, screenscraping, scraping, mining, twitter, facebook, mixi, sentiment, Firefox, Internet Explorer, Selenium, PHP, R
Mining and scraping web sites (both those that provide an API, and those that make it hard), in various world languages. Also text mining social network sites. For instance: measuring sentiment based on large number of Twitter tweets (again, handling more than just English). The same for various news web sites, in a large number of world languages (English, Japanese, Chinese, German, French, Russian, Korean, Hindi and more).
Automation tasks such as posting the same announcement to Facebook, Twitter and a blog, to save a user repetitive and boring work.
(I've also made some minor contributions to the open source Phirehose project, which is used for getting streaming data from Twitter. I have also contributed some demos for the twitterouath library.)
Machine Translation/Text Data Mining
Keywords: semantic network, translation, multilingual, thesaurus, data mining, Japanese, Chinese, German, English, Arabic, Wikipedia, context, xml, very large xml files, PHP, C++
I have had a strong interest in machine translation research, on and off, for over fifteen years. A combination of recent developments make me think a dramatic jump in quality will soon be possible: fast hardware, lots of memory and the availability of freely available multilingual databases such as Wikipedia. The open source movement also gives me hope: imagine how quickly a system could learn if a million people checked in translation corrections every single day.
I have a working prototype doing translations between English, Japanese, Chinese, German and Arabic, with a very high level of quality on the test documents. The engine is designed to be easy to adapt to a specific domain. If you are interested please get in touch.
In 2005 I started the MLSN project, an open source multi-lingual semantic network. This started out as needing a Japanese thesaurus, but it quickly expanded into a more ambitious project to store all the relations between words in multiple languages. 2008 saw a major upgrade:
And then 2009 saw Arabic added. At the current time, dictionary sizes are still quite small, so please be forgiving. Quick links to other online dictionaries in all the supported languages are provided.
- Chinese and German added (to the existing Japanese and English)
- Much cleaner user-interface
- 50x quicker searches
My machine translation research ties in closely with my intelligent search work, financial trading strategy work and with my go project. In a sound bite: all are about understanding context.
Cloud Computing, AWS, Parallel Programming
Keywords: cluster, cloud, cloud programming, thread programming, performance, scaling, AWS, Amazon Web Services, EC2, high performance computing
I have been writing threaded, parallel programs for many years, firstly (and more importantly) for responsiveness (UI handling, listening on a socket, etc.) and secondly to get more speed out of modern multi-core machines. I have also developed systems that work over a cluster to scale processing capacity (and also for improved reliability).
I have experience using Amazon Web Services, and understand the APIs for controlling this from PHP. AWS is good value for handling uneven load, in particular when the peak load cannot be handled on a single high-spec machine, but if your website or application is 24/7 with fairly consistent load then dedicated servers are better value (e.g. one quarter of the per-month cost, at mid-2011 prices). For certain load patterns I have found Rackspace gives better value for money, though I prefer Amazon's use of ECUs to make it clear how much CPU you get.
Intelligent Search/Keyword Suggestion/Multilingual Search Engine
Keywords: semantic network, search engine, Namazu, Chasen, keyword, advertising, synonyms, quick, intelligent, spam classification, open source intelligence, SQL, Japanese, Chinese, German, English, Arabic, data scientist, data analyst
Intelligent search is a key interest of mine; it ties in closely with my interests in machine translation and computer go, and artificial intelligence generally.
Most current search technology relies on an exact match. As a simple example if you search for "woodland" you do not get pages about "forests". MLSN, in conjunction with other technology I have developed, can be used to improve search engine results, by discovering the meaning behind the words: both the words in the search query, and the words on the pages being indexed.
The same techniques that apply to searching the web, and searching within a web site, can also apply to realms such as diverse as spam detection, open source intelligence, advertiser keyword suggestion, and data validation.
My work allows a high level of intelligence and usefulness not just in English, but (currently) also in Japanese, Chinese, German and Arabic. If you would like to learn more please get in touch.
I have used these intelligent search techniques, the open source MLSN database, Namazu (an open source search engine tailored for Japanese), chasen (a Japanese morphological analyzer) and other tools and data sources, in a number of commercial projects, including some for very large sites, and including dynamic database-backed web sites.
Nowadays I am a big fan of jQuery. Not only has it simplified the syntax, but it also almost completely removed the need to think about browser portability issues. I've written a few small jQuery plugins and naturally they are released under a liberal open source license.
Of course I am familiar with HTML5 and CSS (including CSS3). I have also been working on my design skills the past few years, and my photoshopping skills, but I make no claims to be a professional graphic designer.
Keywords: flash, swf, charts, financial charts, PHP, server/client, statistics, dcflash, actionscript, xml sockets, marketing, data visualization
I have an actionscript charting library (called "dcflash"). You can see some demos of it. (I have some more sophisticated demos not released publicly. If you would like to see them please contact me directly.) This library has generated a lot of interest and parts are gradually being released as open source: late 2006 saw a first release on sourceforge: http://dcflash.sf.net/.
Most applications so far have been in financial charting (described in its own section). The charts have also been used in marketing applications (see the pie chart demo) and in data visualization (try the "graph" links in the MLSN results page).
Not charts exactly, but I have also generated exhibition maps (over 500 maps: one for each exhibitor, hilighting their booth), using data from an SQL database. For each exhibitor both an interactive flash version and a static gif version. The maps were re-generated automatically each time the database was updated.
For connections to the back-end live data source I have typically used XML sockets. I have found these efficient and flexible, and therefore most useful. Incidentally my fclib library contains a free PHP xml socket library that works on all of linux, windows and mac, and has been useful for a number of quite diverse projects.
HTML5 looks quite promising to be used instead of Flash for this kind of charting. However it is less efficient (more CPU cycles, more memory), so when the amount of data goes up the charts can end up more sluggish than the same chart written in Flash.
Keywords: flash, swf, charts, finance, financial charts, PHP, server/client, real-time prices, historic data, statistics, technical analysis, dcflash, actionscript, xml sockets
I have worked on a number of financial charting projects, using my flash charting library (see separate entry, and some demos of it).
As one example I developed a sophisticated charting tool for a financial industry start-up. The chart styles include candlestick, bar, area, line, point and figure, market profile. Analysis tools include RSI, stochastics, bollinger bands, volume. This flash application was interactive and it was possible to draw trendlines, gann lines, and fibonacci lines. A live news feed was also included.
In addition to the front-end I wrote the PHP back-end server for this same project. It had the following primary tasks:
- Handle requests from hundred of simultaneous clients, getting data from the database (Microsoft SQL Server), pre-processing results then sending the data to the client. Flash XML sockets were used for communication.
- Receive real-time data from four different exchanges and simultaneously update both the historic data and push the latest data to clients that are watching those symbols.
- Import huge quantities of historical data, from raw tick data.
The entire system was very portable. The back-end ran on either Windows using Microsoft SQL Server, or on linux using MySQL. The front-end ran on all Flash 6 platforms: Windows, Mac, Linux.
I find PHP the most suitable language for most projects, the R language best for data mining and machine learning (including financial data analysis), and C++ for larger, more complex projects or where speed or memory usage is an issue. I have used C++'s Standard Template Library (STL) extensively, and am familiar with C++11 and many of the Boost libraries.
Wherever possible I write code that will compile and run on both UNIX and Windows.
Most of my SQL experience has been with MySQL, Postgresql, sqlite and Microsoft SQL server; I am aware of the differences between these and other databases such as Oracle and DB2. Wherever possible I write vendor-neutral SQL.
(For human languages, English is my native tongue, my Japanese is strong (I got Japanese Proficiency Test 1-kyu in Summer 2009), and I understand basic Chinese, German and Arabic: in each case my reading/writing is notably better than my speaking/listening).
Real-Time Credit Card Processing
Keywords: ecommerce, credit card, Veritrans, billing, encryption, gpg
I have worked on a couple of sites that do live credit card billing, meaning the card is authorized and billed automatically and the customer given a success or failure message immediately. For both sites the credit card processing company was Veritrans. My open source fclib library includes functions to assist in interfacing PHP with Veritrans's perl MDK.
For one client, a major drinks company, I created an online order form with real-time credit card verification and charging. A daibiki (cash-on-delivery) option was also offered.
For the other client, a telecommunication company, the needs were more complex. In addition to initial billing we take a credit authorization. This is then used to bill the customer for usage at regular intervals. In addition, the credit card number is recorded in the database. To ensure this is secure the data is encrypted using public-key encryption. gpg is used to do the encryption and decryption. gpg (or PGP) is also used to encrypt customer information when sending order emails, ensuring privacy.
Keywords: igo, baduk, weiqi, search, trees, games, AI, patterns, hashing, indexing, machine learning, MCTS, UCT, alpha-beta, 9x9, shodan go bet, monte carlo, algorithms, data scientist, data analyst
This is a long-term project (started in 1993 and still far from finished!) to make a program that can play the game of go at the level of a very strong human player. The level of computer go has improved dramatically in the past five years, but there is still much potential for improvement.
This project is a test-bed for my Artificial Intelligence research, in particular data mining, pattern recognition, two-dimensional pattern hashing/indexing and various search algorithms. The necessity for understanding context (in order to make useful patterns) also gives it much in common with my other AI interests of machine translation, automated financial trading and intelligent search.
For the past eight or nine years I have been working on a specific subset of the computer go problem: 9x9. My research started on trying to solve endgames on 9x9 boards (which is a "right-sized" challenge, as it requires strong life and death reading), then moved into using programs together in a team. I presented a peer-reviewed paper entitled A Human-Computer Team Experiment for 9x9 Go at the Computer Games 2010 conference. My research is still continuing, with the eventual aim to be able to state the correct komi for 9x9 go with strong confidence (implying being able to produce a near-perfect opening book).
In 1997 I made a bet with John Tromp that a computer would be able to beat him by the end of 2010, which resulted in the Shodan Go Bet event held in London in December 2010. I ended up losing the bet, and one thousand dollars, but it was good fun and educational, and with hindsight it looks like John may have been cutting it fine and the result might go the other way if the event took place just 12 months later.
As computer go, for me, is mainly a test-bed for more real-world problems I've quite a few half-finished papers that I never seem able to justify the time on finalizing. But a few bits and pieces have been published and can be found at Darren's Computer Go Pages.
Open Source Software
I am a keen supporter of open source software, and make small contributions to numerous projects. I also maintain the following projects:
- FCLIB: a mixed bag of PHP classes and functions
- MLSN: Multi-Lingual Semantic Network (or more modestly, a Japanese thesaurus)
- dcflash: An actionscript utility library, with focus on chart drawing and statistical analysis
I always use an MIT or BSD license for my open source projects, believing in the importance of freedom and the natural willingness of users to submit improvements, without the strong-arming of the GPL "virus" and similar licenses. When open source first became popular I did not really care, but I have become more and more convinced of the importantance of maximum freedom (over many years of having to reinvent libraries that were unusable in a project solely because they are GPL).
Keywords: web server, cluster, load balancer, mail server, SQL database, database replication, router, firewall, high availability, cisco, linux, redhat, fedora, Windows, Sun, BSD
I have setup and maintained numerous Linux servers, running busy production web sites, as well as DNS, database, mail and other servers. Usually running some version of Redhat, Fedora Core, Ubuntu or Debian. I have also worked with Sun and BSD machines in the past and have maintained a Cisco 2600 router/firewall for a couple of years (many years ago). In addition I have setup Linux machines as firewalls and routers when the budget didn't stretch to dedicated hardware. In the distant past I have worked with Windows web servers.
I have setup LAN monitoring and alerts (to email and mobile phone), and have optimized machines and server software for large loads. I have a solid understanding of load balancers, server clusters and database replication. I have studied for the Cisco exams, and have good understanding of router setup and routing protocols (I have not actually taken the Cisco exams).
Note: I am no longer actively chasing server administration work, preferring to focus on software design and coding. My intention here is merely to show I have practical experience with the internet at all levels.
Mobile App/Website Development
Keywords: mobile, iphone, ios, android, apps, mobile apps, keitai, Japanese mobile
I have worked on a range of mobile projects, over a number of years, primarily for the Japanese market. I am familiar with making apps for iOS and Android phones, and up to speed with JQuery Mobile.I have also worked on mailing list distribution to mobile phones, and have a deep understanding of many of the issues in that.
At the risk of showing my age, I have also worked on Brew and Flash Lite (Flash Lite 1 and Flash Lite 2) projects.
PHP Open Source Library (FCLIB)
Keywords: PHP, Japanese, UTF8, web forms, CMS, Veritrans, SQL, XML, csv
This is an open-source library of PHP functions that I've developed over a number of years. The first versions were based on library code I brought with me when I joined FlyingColor (as CTO, in Feb, 2000), and it expanded from there. It is not organized around any particular functionality and is more a mixed bag of functions that have been useful on numerous projects. It is fairly well-documented: all documentation is in the source file, javadoc style.
It can be downloaded at http://dcook.org/software/fclib/.
It is a very casual open source project: I do not actively promote it or try to attract other programmers, though of course I am open to receiving patches.
Web Site/Flash Activity Tracking
Keywords: tracking, demographics, html, web, flash, analysis, data mining, user experience, marketing, site optimization
For one web site I developed a system that tracks the user from entry in the site, then through the various pages in the site. It ties this information to the demographic information in the database when the user is known; it does not require a login however and can track complete user sessions for users who login halfway through the session or do not login at all. It used PHP sessions which means it works transparently whether the user has cookies enabled or disabled, and also works for mobile phones.
The second half of the system then imports the data regularly and automatically produces daily, weekly and monthly reports on user activity.
Web analysis software such as WebTrends claim to do similar analysis from web server logs but this system is both more accurate (for instance WebTrends uses IP addresses, and has to guess when a session starts and ends), and can also tie the information to a back-end user database (e.g. allowing us to compare most common paths through the sites of users of different gender or in different age groups, etc.).
I have also developed a flash movie that tracked the users actions and reported them back to the server, in a format that integrates with this same web site user tracking system. This was done as far as a proof-of-concept, but was not used on any real sites.
Note: this kind of detailed tracking system tends to be overkill - most clients are just not prepared to do anything with the in-depth analysis it can provide.
Keywords: CMS, content mangagement system, php, zend, doctrine, XML, Smarty, Pear, Quickform, UTF8, i18n
I have developed a large number of PHP-based websites, SNSes and CMSes for various clients. In most cases part of the requirements was that an off-the-shelf package would not be flexible enough. Support for multiple Asian languages - not just Japanese - has usually been a requirement.
They are usually based around an SQL DB (typically MySQL, SQLite or Postgresql), though in one case the data was both read and written from a very complex XML file (the same file was used for the print version of the data: a prestigious scientific journal). CSV is used as a database format for certain applications. I find phpMyAdmin can be used as a admin interface to an SQL DB for new projects; but I've also written many customized database interfaces for administration staff.
Zend Framework and Doctrine (doctrine 1) was used as the basis for one website. I like the abstraction Doctrine brings, though I have some concerns about performance. In contrast I found Zend Framework to introduce as much complexity as it abstracts, and have been mostly unhappy with it.
Smarty was used as the template engine for some (older) projects. I have found this easy to use, powerful and easy to extend. Designer resistance seemed common, but no better alternative was ever suggested. I have used Pear::QuickForm on a couple of projects, and similar to Zend, found it introduced as much complexity as it saved for non-trivial projects.
I have used gpg and pgp to encrypt data that is sent out, and have a solid understanding of public key encryption, as well as general security issues for web development (SQL injection, XSS, password hashing vs. encrypting, etc.)
Flash Video Kiosk
Keywords: Flash 8, Actionscript 2, XML, embedded fonts, projector, Screenweaver, mtasc, swfmill, video, flash memory leak workaround, socket server, php, legacy application, SJIS, UTF8
I have created a very flexible video kiosk for one client. It was tied to a music playlist coming from a legacy application. On the server side I wrote a socket server (in PHP) to interface with this legacy application and then be an XML socket server for the front-end clients, which used Flash. Part of the job of the back-end was to convert the Japanese text from Shift-JIS to UTF-8 encoding. Both back and front ends ran on Windows.
On the same project I worked on the actionscript for the front-end flash client. This was primarily for handling parsing of data files and screen transitions, driven by the playlist coming from the back-end. Functionality included text handling in English, Japanese and Chinese; full-screen video; embedded animations; screen transitions; etc.
I used Actionscript 2 and then Actionscript 3 for this project, using mtasc and swfmill.
I also used Screenweaver, which is a tool to turn a swf file into a Windows exe; it is more useful than the projector that flash can make as you can add custom actionscript commands to control window size and visibility. The primary use of Screenweaver in this project was to allow the flash window to be invisible initially while the fonts loaded and it connected, and to position itself on a 2nd monitor (a wide-screen TV).
Complex CSV Imports into SQL databases
Keywords: csv, SQL, XML, tab separated, data filtering, data mining, Japanese, PHP, C++
I have created many scripts for clients needing to get data (normally in CSV, tab-separated or XML format) into an SQL database (typically for use in web sites, email campaigns, machine translation, keyword suggestion, etc.). These scripts handled fixing dirty data, merge-and-purge (e.g. duplicate email addresses), multiple files in different formats and so on.
Handling different Japanese encodings, and copying with mojibake (illegal characters), is a common requirement.
The simpler scripts have been written in PHP; when dealing with large amounts of data highly optimized C++ has used. I have written scripts for everything from a handful of email addresses in a text file to processing multi-gigabyte XML files.
Frequently this import stage is combined with data filtering, helping to reduce very large files (such as the Wikipedia xml dumps) to manageable slices.
Mail Broadcast System
Keywords: email, HTML, mobile, keitai, tracking, demographic analysis, data mining, marketing, PHP, Java
Over ten years ago, I designed and was lead programmer on a system that took demographic information about a user from a back-end database and then created and sent a custom email (HTML or text; mobile phone email also supported). At the time this was the most sophisticated system of its kind. It was mostly used to send Japanese emails, and naturally coped with the complexities of Japanese email and various character set standards.
It tracked email bounces, links (and views in the case of HTML), and produced reports cross-referencing demographics with link clicks and other demographics. It allowed the creation of custom reports, data screens and campaign targeting based on not just demographics but also user activity in previous campaigns. It was commonly integrated with the back-end databases for web sites.
The system was called PEP and was developed for FlyingColor Group. Most of the system development was in PHP, but there was also been some work in java. The system was used for a large number of campaigns, promotions and newsletters over a period of 7-8 years.