Posted by tomwhite
on July 20, 2007 at 1:10 AM PDT
How to run data processing applications on a rented grid.
I've raved about the MapReduce parallel programming model in the past , and Apache Hadoop (the framework for running MapReduce applications), and Amazon's compute and storage webservices (EC2 and S3). Now I've written an article - Running Hadoop MapReduce on Amazon EC2 and Amazon S3 - about using them all together to do some data crunching.
The nice thing is that you can fire up a fair sized Hadoop cluster (20 nodes is the current limit on EC2 ) in minutes and run it just for as long as you need to run your job - you pay by the hour. EC2 is still in limited beta and has had long waiting lists to get on it, but recently they cleared the backlog , so if you're interested in trying it out, now might be a good time.