Thursday, January 5, 2012

Running a Ruby map-reduce job with Hadoop

I am currently developing and app in my spare time and needed to merge a file with itself to include movies that are both Romance and Comedy.

The file looked somthing like this:

movie-a Comedy
movie-b Comedy
movie-a Romance

I wanted to produce a file of the form

movie-a [Comedy,Romance]

Ignoring the movies that don't include both genres.

I inmediately thought of using hadoop, even if the file was not huge, the map reduce algorithm seems a good fit for the problem.

I have done some small work in hadoop with Java, but in this case my project was Ruby based and I wanted to keep on using Ruby even for my hadoop job, so I used the streaming API of hadoop to solve the problem.
I needed to develop a fast and easy solution. What follows is the code:

Consulting some bibliography and the Web I came to a very easy solution.

map.rb
  1. ARGF.each do |line|
  2.    begin
  3.      parts = line.split("\t")
  4.      puts parts[0]+"\t"+ parts[parts.size-1]
  5.    rescue
  6.      puts 'error'
  7.    end
  8. end
reduce.rb
  1. current_key = nil
  2. current_key_values=[]
  3. ARGF.each do |line|
  4.    line = line.chomp
  5.    (key, value) = line.split(/\t/)
  6.    if current_key.nil?
  7.      current_key=key
  8.    end
  9.    if current_key==key
  10.      current_key_values<<value
  11.    else
  12.      if current_key_values.include?("Comedy") and current_key_values.include?("Romance")
  13.        puts current_key + "\t" + current_key_values.to_s
  14.      end
  15.      current_key=key
  16.      current_key_values=[value]
  17.    end
  18. end
hadoop_run.sh
  1. #!/bin/bash
  2.  
  3. HADOOP_HOME=/home/cscarioni/programs/hadoop-0.22.0
  4. JAR=contrib/streaming/hadoop-0.22.0-streaming.jar
  5.  
  6. STREAMCOMMAND="$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/$JAR"
  7.  
  8. $STREAMCOMMAND \
  9.  -mapper 'ruby map.rb' \
  10.  -reducer 'ruby reduce.rb' \
  11.  -file map.rb \
  12.  -file reduce.rb \
  13.  -input '/home/cscarioni/Downloads/comedy_romance_movies' \
  14.  -output /home/cscarioni/Downloads/comedy_romance_movies_results


I consider the main differences between the streaming API and the Java API are:

1. The streaming API works everything in the stdin and stdout between the scripts, (take a look at the ARGF and puts use in both map.rb and reduce.rb) like when we use the pipe in the command line between commands

2. In the Java API the results from the mapper phase are grouped together for example in our case we would actually receive directly on the reduce phase the line (movie-a [Romance,Comedy]). In the streaming API on the other hand, the grouping needs to be done manually, what we get is an ordered list by key (so all the movie-a would be next to each other).

So there we have a small and functional map reduce job in Ruby with Hadoop.