MapReduce is a programming model for processing vast amounts of data. One of the reasons that it works so well is because it exploits a sweet spot of modern disk drive technology trends. In essence MapReduce works by repeatedly sorting and merging data that is streamed to and from disk at the transfer rate of the disk. Contrast this to accessing data from a relational database that operates at...
on Mar 18, 2008
I've bumped into consistent hashing a couple of times lately. The paper that introduced the idea (Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web by David Karger et al) appeared ten years ago, although recently it seems the idea has quietly been finding its way into more and more services, from Amazon's Dynamo to memcached (...
on Nov 27, 2007
I've raved about the MapReduce parallel programming model in the past, and Apache Hadoop (the framework for running MapReduce applications), and Amazon's compute and storage webservices (EC2 and S3). Now I've written an article - Running Hadoop MapReduce on Amazon EC2 and Amazon S3 - about using them all together to do some data crunching.
The nice thing is that you can fire up a fair sized...
on Jul 20, 2007
In March I wrote of affordable web-scale computing:
I would love an API that exposes Google's MapReduce, a simple programming model for crunching on large datasets. You can write and run MapReduce programs today, using Hadoop, but it's only really useful if you have enough machines at your disposal. The pay-as-you-go model of S3 (and Sun Grid) would be very attractive to developers who want...
on Aug 24, 2006
In case you haven't heard of it, Amazon S3 is a web service for storing data.
The two great things about it are that it's simple (look at its nice REST API), and it's cheap (with a pay-as-you-go charging model).
This latter point explains the growing number of startups that are using it to launch new business ventures: no data silos to maintain, and pay by the gigabyte.
My favourite innovative...
on Aug 13, 2006
With the launch of Amazon S3 (Simple Storage Service) we are seeing a continuation of the trend for the big web companies to monetize their computing infrastructure by opening it up to developers.
It is probably only a matter of time before we see Google create something similar, which would essentially be a limited public interface onto the Google File System.
I would love an API that exposes...
on Mar 17, 2006
In a previous blog
I wrote about Nutch's MapReduce implementation, for distributed processing of massive data sets. This, and the closely related Nutch Distributed File System (renamed Hadoop Distributed File System), have now been moved into a standalone project called Hadoop.
According to Doug Cutting, who created Hadoop (as well as Lucene and Nutch), the name comes from:
The name my kid...
on Feb 8, 2006
Doug Cutting has done it again. The creator of Lucene and Nutch has implemented (with Mike Cafarella and others) a distributed platform for high volume data processing called MapReduce.
MapReduce is the brainchild of Google and is very well documented by Jeffrey Dean and Sanjay Ghemawat in their paper MapReduce: Simplified Data Processing on Large Clusters. In essence, it allows...
on Sep 25, 2005