Hadoop Streaming with PHP

I’ve started my journey with Hadoop, and the first thing I wanted to try was Streaming, so I could run the mapper and reducer methods with PHP programs.

The first thing I did was setup an alias:

alias stream='/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/contrib/streaming/hadoop-0.18.3-streaming.jar'


The next thing was to create a scripts dir in my $HADOOP_HOME (/usr/local/hadoop) dir.

wc_mapper.php

#!/usr/bin/php
<?php
  error_reporting(0);
  $in = fopen("php://stdin", "r");
  $results = array();
  while ( $line = fgets($in, 4096) )
  {
    $words = preg_split('/\W/', $line, 0, PREG_SPLIT_NO_EMPTY);
    foreach ($words as $word)
      $results[$word] += 1;
  }
  fclose($in);
  foreach ($results as $key => $value)
    print "$key\t$value\n";
?>

wc_reducer.php

#!/usr/bin/php
<?php
  error_reporting(0);
  $in = fopen("php://stdin", "r");
  $results = array();
  while ( $line = fgets($in, 4096) )
  {
    list($key, $value) = preg_split("/\t/", trim($line), 2);
    $results[$key] += $value;
  }
  fclose($in);
  ksort($results);
  foreach ($results as $key => $value)
    print "$key\t$value\n";
?>

To execute:

stream -input conf -output output4 -mapper /usr/local/hadoop/scripts/wc_mapper.php -reducer /usr/local/hadoop/scripts/wc_reducer.php

I’ll come back later and document. Just wanted to get the initial recorded.

Comments are closed.