Holy Smokes, Hadoop works with S3 directly!

bin/hadoop fs -put /path/to/source s3://<S3ID>:<S3SECRET>@<BUCKET>/path/to/destination

This is so cool. I’m guessing that I could also use S3 as my input or output directory for Map/Reduce jobs.

For example:

/usr/local/hadoop/bin/hadoop jar \
  /usr/local/hadoop/contrib/streaming/hadoop-0.18.3-streaming.jar \
  -input s3://<S3ID>:<S3SECRET>@<BUCKET>/conf \
  -output s3://<S3ID>:<S3SECRET>@<BUCKET>/conf-wc_output \
  -mapper /usr/local/hadoop/scripts/wc_mapper.php \
  -reducer /usr/local/hadoop/scripts/wc_reducer.php

And, yes, it works, can cat the results:

bin/hadoop fs -cat s3://<S3ID>:<S3SECRET>@<BUCKET>/conf-wc_output/part*

I did find a bug, though:

bin/hadoop fs -ls 's3://<S3ID>:<S3SECRET>@<BUCKET>/'
Found 3 items
drwxrwxrwx   - ls: -0s
Usage: java FsShell [-ls <path>]

… the “ls” command doesn’t seem to work right against S3 directories. If found 3 items, that’s right, but it doesn’t list them correctly.

Anyway, my eyes are getting big. First thing that pops in my mind is use EC2 to spin up hadoop clusters to work on vast amounts of data stored on S3. Well, Amazon already had this idea: http://aws.amazon.com/elasticmapreduce/

Comments are closed.