Wednesday, February 18, 2015

Clustering customers for machine learning with Hadoop and Mahout

The problems

This post was originally published in my company's tech blog Simplybusiness

The problems

  • We manage quite a bit of customer data, starting from the beginning of a customer's search for a new insurance policy, all the way until they buy (or don't buy) the policy. We keep all of this data, but for the most part, we don't do anything to improve our customer offering.
  • Our site looks exactly the same for every customer -- we don't try to engage with them on a more personal level. No customisation exists, which means that each customer's experience doesn't adapt to his or her personality, specific trade, or home or business location. Nothing at all.
  • Filling out a long form is always boring. But filling it out while being unsure of what information to put where, and being forced to make a phone call to confirm details, is even more boring. In many cases, it could mean the customer just gets bored and leaves our form.

The idea

We wanted to create and test a solution that allowed us to group together similar customers using different sets of dimensions depending on the information we wanted to provide or obtain. We thought about introducing clustering technology and algorithms to group our customers.

This would be a very rough implementation that would allow us to prove certain techniques and solutions for this type of problems -- it certainly would NOT cover all the nuances that machine learning algorithms and analysis carry with them. Many liberties were taken to get to a proof of concept. The code presented here is not 100% the same code used in the spike, but it forms a very accurate approximation

This post covers the implementation of the solution.

Solution

Setting up the Clustering backend algorithms to allow multidimensional clustering.
  • I had already decided that I would put into practice my knowledge of Mahout and Hadoop to run the clustering processing. I installed Hadoop using my own recipes from hadoop-vagrant to be run on a local Vagrant cluster, and then to be run in a AWS cluster.
  • Hadoop is a framework that allows the processing of certaing types of tasks in a distributed environment using commodity machines that allows it to massively scale horizontaly. Its main components are the map-reduce execution framework and the HDFS distributed filesystem. For more details, check out my blog post.
Getting the data:
  • After Hadoop was installed, the first task was to find and extract the data. The data was stored on a SqlServer database, so we needed to fetch it and put it into HDFS. There is a fantastic tool called Sqoop that's built just for this. Sqoop not only allows you to get data ready for HDFS, it actually uses Hadoop itself to paralellize the extraction of the data. The standard way to run Sqoop is as follows: sqoop import --driver com.microsoft.sqlserver.jdbc.SQLServerDriver --connect "jdbc:sqlserver://xxx:1433;username=xxx;password=xxx;databaseName=xxx" --query "select xxx from xxxx" --split-by customer._id --fetch-size 20 --delete-target-dir --target-dir aggregated_customers --package-name "clustercustomers.sqoop" --null-string '' --fields-terminated-by ','

The previous command generates the required hadoop compatible files that will be used in the subsequent analysis. The most important part is the query that you want to use to extract the data. In our case, in the first iteration, we extracted information like trade, vertical, claims, years_insured, and turnover. These values are the dimensions that we will use to group our "similar" customers.

K-Means Clustering.

I have read quite a bit about different machine learning techniques and algorithms. I have developed a bit with them in the past, particularly in the recommendation area. The first thing to decide with a Machine Learning problem is what exactly I want to achieve. First, let's look at the three main problems that Machine Learning solves, and then follow the reasoning behind my choices.

Machine Learning algorithms in Mahout can be broadly categorized in three main areas:

  • Recommendation Algorithms: Try to make an informed guess about what things you might like out of a large domain of things. In the simplest and most common form, the inference is done based on similarity. This similarity could be based on items that you've already said you like, or similarity with other users that happen to like the same items as you.
    • Assume we have a database of movies, and say you like Lethal Weapon.
      • Item-Based similarity:
        • recommendations for movies similar to Lethal Weapon.
      • User-Based similarity:
        • recommendations for movies that other people who liked Lethal Weapon liked as well
  • Classification Algorithms: In the family of Supervised Learning algorithms (supervised because the set of resolutions and categories are known beforehand). Classification algortihms allow you to assign an item to a particular category given a set of known characteristics (where the category belongs to a limited set of options)
    • This technique used in Spam detection systems.
      • Let's say you decide that any email with at least two of the following characteristics: 4 or more images, 4 words written in all-capital letters, and the text 'congratulations' with an exclamation mark at the end should be marked as Spam, and anything with fewer than two of these is not Spam.
      • The Classification system will be built upon this characteristics and rules. It knows that any incoming email that matches these rules will belong to the corresponding category.
  • Clustering Algorithms: These belong to the Unsupervised Learning family because there is no predetermined set of possible answers. Clustering algorithms are simply given a set of inputs with dimensions. The algorithm itself works out how to organize and group the data into individual clusters.

Given the previous definitions, it was very clear to me that what we needed was a Clustering solution because I didn't have any idea how the data was supposed to be organized. I wanted the system to figure out the clustering and return a set of groups containing similar customers in each of them.

I selected K-Means clustering as the clustering algorithm I wanted to use. I have some familiarity with it, and it's the most common clustering algorithm in use and in the bibliography around.

To quickly explain K-Means: When given a number of elements N and a number of resulting clusters K, it finds K centroid points (firstly at random) and iterates an X amount of times finding the n elements in N that are closer to each centroid k and grouping them together. In each iteration x the centroids are recalculated and the n elements are assigned to the cluster determined by their closest centroid. A better explanation is here.

From the previous explanation, you can see that K-Means expects the number of clusters K as an input. However, I had no idea at all of what a good number of clusters would be. In Mahout, you can combine K-Means with another clustering algorithm named Canopy. Canopy is capable of finding a first set of K centroids that can then be fed into K-Means.

The way Canopy works is roughly the following: Instead of being given a K for the total number of clusters, you provide Canopy with a measure of the size that you expect each cluster to have. This allows you to get different sized clusters on different runs of the algorithm (i.e. if you want to cluster people by a wider or narrower geographical location). The algorithm works by using a distance measure (most machine learning algorithms use this to find similarities) and a couple of threshold values T1 and T2. Looping through all the points in the dataset, the algorithm takes each point and compares aginst T1 and T2 (T2 > T2) to each of the already created Canopies. If it is between T2, it will be assigned to that existing Canopy. If it is between T1, but not T2, it will be added to the Canopy but will be allowed to be part of another Canopy. If it is neither it will be used as the center of a new Canopy. At the end, after all Canopies are created, the centroid is calculated for each of them. And these will be the centroids used for the next step using K-Means.

Deciding on and weighting the dimensions

For the K-Means, Canopy and in most Machine Learning algorithms, the way to find whether or not a particular item belongs or how to make a recommendation is based on a measure of distance. To be able to measure the distance between two items (customers, in our use case) their values need to be converted into a form that allows them to be compared and measured. This means that we need to convert our values, whatever they are, to a numeric representation that would allow us to use traditional distance measure algorithms to compare them.

The dimensions I planned on using for this first iteration were:

  • product
  • trade
  • turnover
  • employees
  • claims

Out of these dimensions, only 2 were already numbers (turnover and claims), and the other 3 were text values. Even trickier, only employees followed a directly comparable value ("less than 5 employees" is comparable to "more than 100 employees") while the other 2 were discret disjunt values (product "business" is not direcly comparable to product "shop").

For the case of employees, I converted the values to consecutive numbers like (values are made up for this example):

  • "less than 15 employees" -> 1
  • "between 15 and 50 employees" -> 2
  • "between 50 and 200 employees" -> 3

For the two discrete properties product and trade, I have to create individual dimensions for each of the discrete values that they can be. As in my example I was only going to use Baker and Accountant for trades and Shop and Business for product, the final dimensions Vector ended something like:

| shop | business | accountant | baker | turnover | employees | claims |

So let's say we wanted to model an accountant with 50000 turnover 20 employees and 2 claims. His vector would look like:

| 0 | 1 | 1 | 0 | 50000 | 2 | 2 |

We can already see a problem with this vector. In particular we can see that the value turnover is much larger than the rest of the dimensions. This means that the calculation of measure will be extremely influenced by this value: we say this value has a much bigger weight than the rest. For the example, we assume that there is a maximum turnover of 100000.

In our case, we want to give extra weight to the product and trade dimensions and make turnover much less significant.

Mahout offers some functionality for doing just that. Normally as an implementation of the class WeightedDistanceMeasure it works by building a Vector with multipliers for each of the dimensions of the original vector. This vector needs to be the same size as the dimensions vector. In our case, we could have a vector like this:

| 10 | 10 | 5 | 5 | 1/100000 | 1/2 | 1/10 |

The effect of that Vector will be to alter the values of the original by multiplying the product by 10, the trade by 5, making sure that turnover is always less than 1, halving the influence of number of employees and making claims less influential.

NOTE: Finding the correct dimensions and weights for a clustering algorithm is a really hard exercise which normally requires multiple iterations to find the "best" solution. Our example, following with the Spike approach for this hackathon, is using completely arbitrary values chosen just to prove the technique, and not carefully crafted normalizations of data. If these values are good enough for our examples, then they are good enough.

Following are the main parts of the raw code written to convert the initial data to a list of vectors:

public class VectorCreationMapReduce extends Configured implements Tool {

public static class VectorizerMapper extends Mapper<LongWritable, Text, Text, VectorWritable> {

    public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        VectorWritable writer = new VectorWritable();
        System.out.println(value.toString());
        String[] values = value.toString().split("\\|");
        double[] verticals = vectorForVertical(values[1]);
        double[] trade = vectorForTrade(values[2]);
        double[] turnover = vectorForDouble(values[3]);
        double[] claimCount = vectorForDouble(values[4]);
        double[] xCoordinate = vectorForDouble(values[7]);
        double[] yCoordinate = vectorForDouble(values[8]);
        NamedVector vector = new NamedVector(new DenseVector(concatArrays(verticals, trade, turnover, claimCount, xCoordinate, yCoordinate)), values[0]);
        writer.set(vector);
        context.write(new Text(values[0]), writer);
    }

 }

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    int res = ToolRunner.run(conf, new VectorCreationMapReduce(), args);
    System.exit(res);
}

  @Override
  public int run(String[] strings) throws Exception {
    Configuration conf = super.getConf();
    conf.set("fs.default.name", "hdfs://"+ Configurations.HADOOP_MASTER_IP+":9000/");
    conf.set("mapred.job.tracker", Configurations.HADOOP_MASTER_IP+":9001");
    Job job = new Job(conf, "customer_to_vector_mapreduce");
    job.setJarByClass(VectorCreationMapReduce.class);
    job.setMapperClass(VectorizerMapper.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(VectorWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(VectorWritable.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    job.setNumReduceTasks(0);
    FileInputFormat.addInputPaths(job, "aggregated_customers_with_coordinates");
    FileOutputFormat.setOutputPath(job, new Path("vector_seq_file"));
    return job.waitForCompletion(true) ? 0 : 1;
  }
}

The code is a Hadoop map-reduce job (with only map phase) that takes the input from the Sqoop exported file and creates a NamedVector with the dimensions values. The Vector classes used are Mahout provided classes for use within Hadoop.

The next step was to run the actual Canopy algorithm to find the K centroids that we are going to feed to the K-Means algorithm.

This is run something like:

./hadoop org.apache.mahout.clustering.canopy.CanopyDriver  -i /vector_seq_file/part-m-00000 -o customer-centroids -dm clustercustomers.mahout.CustomWeightedEuclideanDistanceMeasure -t1 4.0 -t2 2.0

The previous command specifies the Mahout class that contains the Canopy Hadoop job. It specifies an input file from HDFS which is the output of the file generated by the previous vectorization process. It also specifies an output file customer-centroids where the generated vector centroids will be generated to. We also specify that we want to use an Euclidean distance measure with weighting, which is defined with the custom class we see below:

public class CustomWeightedEuclideanDistanceMeasure extends WeightedEuclideanDistanceMeasure {

  public static final Vector WEIGHTS = new DenseVector(new double[]{10, 10, 5, 5 ,1/1000, 1/2,1/10});
  public CustomWeightedEuclideanDistanceMeasure(){
      super();
      setWeights(WEIGHTS);
  }
}

This class simply extends the Mahout provided WeightedEuclideanDistanceMeasure and sets the custom weight vector that we mentioned above. This will make sure that when the algorithm runs, the weighting will be applied to all the vectors from the input.

Now that we have generated our K centroids, it is time to run the actual K-Means clustering algorithm. This is also very simple to run by using the Mahout provided classes:

 hadoop org.apache.mahout.clustering.kmeans.KMeansDriver  -i vector_seq_file/part-m-00000 -c customer-centroids/clusters-0-final -o customer-kmeans -dm clustercustomers.mahout.CustomWeightedEuclideanDistanceMeasure -x 10 -ow --clustering

In this command, we are specifying that we want to run the KMeansDriver hadoop job. We pass in the input vector file again, and we specify the centroids file generated by canopy with the -c option. We then specify where the clustering output should go, the use of the weighting mechanism again, and how many iterations we want to do on the data.

Here's a quick overview of how K-Means actually works:

The K-Means Clustering algorithm starts with the given set of K centroids and iterates over adjusting the centorids until the iteration limit X is reached or until the centroids converge to a point from where they don't move. Each iteration has 2 steps and works the following way. - For each point in the input, it finds the nearest centroid and assigns the point to the cluster represented by that centroid. - At the end of the iteration, the points are averaged to recalculate the new centroid possition. - If the maximum number of iterations is reached, or centroid points don't move any more, the clustering concludes.

K-Means (and Canopy as well) are parallelizable algorithms, meaning that you can have many jobs working on a subset of the problem and aggregating results. This is where Hadoop comes in for the clustering execution. Internally Mahout and in particular KMeansDriver is build to work on the Hadoop Map-Reduce infrastructure. By leveraging Hadoop's map-reduce proved implementation, Mahout algorithms are able to scale to very big data sets and process them in a parallel way.

After generating this cluster, the next step is to create individual cluster files and to a single cluster file with the simple syntax (cluster_id, customer_id).

This is done with the following map and reduce methods:

 public static class ClusterPassThroughMapper extends Mapper<IntWritable, WeightedVectorWritable, IntWritable, Text> {
    public void map(IntWritable key, WeightedVectorWritable value, Context context) throws IOException, InterruptedException {
      NamedVector vector = (NamedVector) value.getVector();
      context.write(key,new Text(vector.getName()));
    }
}

public static class ClusterPointsToIndividualFile extends Reducer<IntWritable, Text, IntWritable, Text> {
    private MultipleOutputs mos;

    public void setup(Context context) {
        mos = new MultipleOutputs(context);
    }


    public void reduce(IntWritable key, Iterable<Text> value, Context context) throws IOException, InterruptedException {
        for(Text text: value){
            mos.write("seq", key, text, "cluster-"+key.toString());
            context.write(key,text);
        }
    }

    public void cleanup(Context context) throws IOException, InterruptedException {
        mos.close();
    }

This has allowed us to obtain clusters of customers - next week's post will explore what can be done with these clusters!

PART 2

Read the first part of this hackathon implementation at Clustering our customers to get a full background on what's being presented here!

Analysing the data

At the end of the first post, we have obtained various clusters of customers in HDFS. This is the most important part of the job. However, we still need to do something with those clusters. We though about different possibilities: "The most common frequently asked question among members of a cluster", "What is the average premium that people pay in a particular cluster?", "What is the average time people are being insured with us in a particular cluster?".

Most of the analysis we defined here were straightforward aggregations, a job that Hadoop map-reduce does nicely (moreso as the data is already stored on HDFS).

There are many ways to analyse data with Hadoop including Hive, Scalding, Pig, and some others. Most of these do translations from a particular language to the native map-reduce algorithms supported by Hadoop. I chose Apache Pig for my analysis as I really liked the abstraction it creates on top of the standard Hadoop map-reduce. It does this by creating a language called Pig Latin. Pig Latin is very easy to read and write, both for people familiar with SQL analysis queries and for developers familiar with a procedural way of doing things.

Here I will show just one example, how to calculate the average premium for each cluster. Following is the Pig program that does this:

premiums = LOAD '/user/cscarion/imported_quotes' USING PigStorage('|') AS(rfq_id, premium, insurer);
cluster = LOAD '/user/cscarion/individual-clusters/part-r-00000' AS(clusterId, customerId);
customers = LOAD '/user/cscarion/aggregated_customers_text' using PigStorage('|') AS(id, vertical, trade, turnover, claims,rfq_id);

withPremiums = JOIN premiums BY rfq_id, customers BY rfq_id;

store withPremiums into 'withPremiums' using PigStorage('|');

groupCluster2 = JOIN withPremiums BY customers::id, cluster BY customerId;

grouped2 = GROUP groupCluster2 BY cluster::clusterId;

premiumsAverage = FOREACH grouped2 GENERATE group, AVG(groupCluster2.withPremiums::premiums::premium);

STORE premiumsAverage into 'premiumsAverage' using PigStorage('|');

The previous Pig program should be fairly straightforward to understand.

  • The RFQs with their respective premiums are loaded from HDFS into tuples like (rfq_id, premium, insurer)
  • The cluster information is loaded from HDFS into tuples like (cluster_id, customer_id)
  • The customers are loaded from the originally imported file into a tuple like (id, vertical, trade, turnover, claims,rfq_id)
  • The RFQs are joined with the customers using the rfq_id. This will basically add the premium to the customer collection.
  • The result of the previous join is joined with the cluster tuples. This essentially adds the cluster_id to the customer collection.
  • Results in the previous join are grouped by the cluster_id
  • For each group computed in the previous step, the average is calculated.
  • The results from the previous step are stored back into HDFS as a tuple (cluster_id, average_premium)

Querying the data from the main application

Now that we have the data and the analytics are passed to it, the next step is to consume this data from the main web application. For this, and to make it as unobtrusive as possible, I created a new server application (using Play) with a simple interface that was connected to the HDFS and could be queried from our main application.

So for example, when a customer is filling the form we can invoke an endpoint on this new service like this: GET /?trade=accountant&product=business&claims=2&turnover=25000&amployees=2

This call will vectorise that information, find the correct cluster, and return the information for given cluster. The important parts of the code follow:

def similarBusinesses(business: Business): Seq[Business] = {
  loadCentroids()
  val cluster = clusterForBusiness(business)
  val maxBusinesses = 100
  var currentBusinesses = 0
  CustomHadoopTextFileReader.readFile(s"hdfs://localhost:9000/individual-clusters/cluster-${cluster}-r-00000") {
    line =>
      val splitted = line.split("\t")
      userIds += splitted(1)
      currentBusinesses += 1

  }(currentBusinesses < maxBusinesses)
  businessesForUserIds(userIds)
}

That code finds the cluster for the business arriving from the main application. Then it reads the HDFS file representing that individual cluster and gets the business information for the returned user ids.

To find the cluster to which the business belongs to, we compare against the stored centroids:

  private def clusterForBusiness(business: Business): String = {
    val businessVector = business.vector
    var currentDistance = Double.MaxValue
    var selectedCentroid: (String, Cluster) = null
    for (centroid <- SimilarBusinessesRetriever.centroids) {
      if (distance(centroid._2.getCenter, businessVector) < currentDistance) {
        currentDistance = distance(centroid._2.getCenter, businessVector);
        selectedCentroid = centroid;
      }
    }
    clusterId = Integer.valueOf(selectedCentroid._1)
    selectedCentroid._1
  }

The code that actually reads the Hadoop filesystem follows, it looks like reading a simple file:

object CustomHadoopTextFileReader {
  def readFile(filePath: String)(f: String => Unit)(g: => Boolean = true) {
    try {
      val pt = new Path(filePath)
      val br = new BufferedReader(new InputStreamReader(SimilarBusinessesRetriever.fs.open(pt)));
      var line = br.readLine()
      while (line != null && g) {
        f(line)
        line = br.readLine()
      }
    } catch {
      case e: Exception =>
        e.printStackTrace()
    }
  }
}

Then to return the premium for a particular cluster:

def averagePremium(cluster: Int): Int = {
  CustomHadoopTextFileReader.readFile("hdfs://localhost:9000/premiumsAverage/part-r-00000") {
    line =>
      val splitted = line.split("\\|")
      if (cluster.toString == splitted(0)) {
        return Math.ceil(java.lang.Double.valueOf(splitted(1))).toInt
      }
  }(true)
  0
}

Any of these values can then be returned to the calling application - this can provide a list of "similar" businesses or retrieve the premium average paid for those similar businesses.

This was it for the first iteration of the hack! The second iteration was to use the same infrastructure to cluster customers based on the location of their businesses instead of the dimensions used here. Given that the basic procedure for clustering remains the same despite the dimensions utilised, the code for that looks similar to the code presented here.

Friday, September 5, 2014

Ruby unit testing is weak as a safety net

Ruby unit testing feels very nice and natural to write with RSpec. Since I started working with Ruby more often that is one of the things I liked the most. However I still love Java (or for this particular example anything with strong typing) and is still my main language, and in many cases is superior to Ruby.

One of those is the value of Unit Tests over the long run and as safety net and aid in refactoring.

Let's see this example in Ruby first:

class Collaborator
  def do_stuff

  end
end

class Unit
  def initialize(collaborator)
    @collaborator = collaborator
  end

  def call_collaborator
    @collaborator.do_stuff
  end
end

describe Unit do
  let(:collaborator) { Collaborator.new }
  subject { Unit.new(collaborator) }

  it 'should call collaborator' do
    expect(collaborator).to receive(:do_stuff)
    subject.call_collaborator
  end
end

That seems like a very simple Unit test. Is making sure that my class under test makes a necesary method call to a collaborator.

I run the test and it passes as expected:

 $ bundle exec rspec
.

Finished in 0.00051 seconds
1 example, 0 failures

Now, sometime later I want to change my collaborator. The collaborator would have its own unit test that I would need to change first of course. However I won't look all over my code base where this method is used (already very difficult to do in Ruby). Then I make my collaborator look like this:

class Collaborator
  def do_the_important_stuff

  end
end

If I rerun my test for Unit this is what I get:

.

Finished in 0.00051 seconds
1 example, 0 failures

Yes, the test still pass even if the method that is supposed to be called in the collaborator doesn't exist anymore!. This would for sure break in production when this branch of code is reached. This can (and need) to be mitigated with more integration tests and not trust so much on the unit tests, but still it leaves the original useless unit test in place.

With Java it is a completely different story and the Unit tests can be trusted a lot more. Here is the same example:

interface Collaborator {
  void doStuff();
}

 public class Unit {
   private Collaborator collaborator;

   public Unit(Collaborator collaborator) {
    this.collaborator = collaborator;
  }

  public void callCollaborator() {
    this.collaborator.doStuff();
  }
}

public class TestEr {

  private Collaborator collaborator = Mockito.mock(Collaborator.class);
  private Unit unit = new Unit(collaborator);

  @Test
  public void shouldCallCollaborator() {
    unit.callCollaborator();
    Mockito.verify(collaborator).doStuff();
  }
}

This test passes as well. But now if go about and change the Collaborator to be

interface Collaborator {
  void doSomeStuff();
}

We will automatically get a compilation error in all places that are using this method. Including our unit test. This will drag us direcly to the error to fix the dependency properly before deploying anything. In this case the unit test has proven to be really valuable as a safety net to capture a problem introduced by a seemingly simple refactor.

Of course the example is super simple, but you can imagine in a big scale application the Ruby problem is not so unrealistic.

Friday, October 11, 2013

Checking Jquery Ajax requests from Capybara

I was working in a particular feature today. I wanted to test an autocomplete input box that makes its requests via Ajax. Something very common.

I wanted to test that the requests were being done to the right URL in a black box kind of way using the cucumber features.

This is what I did:

  • First added a new visit method visit_tracking_ajax to my Cucumber environment with the following content

    module CucumberGlobals
     def visit_tracking_ajax(url)
      visit(url)
      page.execute_script("$(document).ajaxSuccess(function(a,b,c){console.log(this);window._latest_ajax_url_called = c.url;})")
     end
    end
    
    World(CucumberGlobals)
    

This executes a Javascript Jquery script in the page that will listen to every Ajax request being made (through Jquery) and set a variable on window with the URL called in that request.

  • Then from my step definition I did something like:

    page.latest_ajax_url.should == 'http://www.xx.xxx'
    
  • Having also added the following method to the Capybara session:

    module CustomMethods
     def latest_ajax_url
       evaluate_script("window._latest_ajax_url_called")
     end
    end
    
    module Capybara
      class Session
        include CustomMethods
      end
    end
    

That's it, now I was checking the last ajax url being called. This is fragile code, but it worked fine for my current use case.

Monday, October 7, 2013

Simple setup for testing MongoDB map reduce with Ruby

Simple setup for testing MongoDB map reduce with Ruby

This is a superquick tutorial to running Mongo Map Reduce aggregations from Ruby. This is as simple as it gets. A MongoDB running locally and a unique Ruby script. For the test I created the follwoing simple collection in a Mongo DB called example

> db.values.find()
  { "_id" : ObjectId("52530971a1b6dd619327db27"), "ref_id" : 1, "value" : 2 }
  { "_id" : ObjectId("5253097ba1b6dd619327db28"), "ref_id" : 1, "value" : 4 }
  { "_id" : ObjectId("52530983a1b6dd619327db29"), "ref_id" : 2, "value" : 10 }
  { "_id" : ObjectId("52530989a1b6dd619327db2a"), "ref_id" : 2, "value" : 15 }
  { "_id" : ObjectId("52531414a1b6dd619327db2b"), "ref_id" : 3, "value" : 100 } 

Then I created a simple Ruby project with the only entry gem "mongoid" in the Gemfile.

The following mongo.yml configuration file:

development:
  sessions:
    default:
      database: mongoid
      hosts:
        - localhost:27017

The Ruby script is then

require 'mongoid'
Mongoid.load!("/Users/cscarioni/ossources/mongo_map_reduce/mongo.yml")

session = Moped::Session.new([ "127.0.0.1:27017"])

session.use :example

map = %q{
  function() {
    emit(this.ref_id,{value: this.value});
  }
}

reduce = %q{
  function(key, values) {
    var value_accumulator = 0;
    values.forEach(function(val){
    value_accumulator += val.value
  });
   return {val: value_accumulator} 
  }
}


puts session.command(
  mapreduce: "values",
  map: map,
  reduce: reduce,
  out: { inline: 1 }
)

When running the script, we get the following results:

 MONGOID_ENV=development ruby mongo_map_reduce.rb


{
  "results"   =>   [
    {
       "_id"         =>1.0,
       "value"         =>         {
          "val"            =>6.0
       }
    },
    {
       "_id"         =>2.0,
       "value"         =>         {
          "val"            =>25.0
       }
    },
    {
       "_id"         =>3.0,
       "value"         =>         {
          "value"            =>100.0
       }
    }
 ],
  "timeMillis"   =>13,
  "counts"   =>   {
    "input"      =>5,
    "emit"      =>5,
    "reduce"      =>2,
    "output"      =>3
  },
  "ok"   =>1.0
}

That's it. A quick example. Sometimes it is more convenient to use a Ruby script than deal with the Mongo Json shell interface.

Sunday, September 15, 2013

Analyzing your Mongo data with Pig

MongoDB offers a fantastic query interface, however sometimes in Big Data scenarios we may want to combine our MongoDB database with the processing power of Hadoop in order to extract and analyze the data stored in our DB.

Working with Hadoop directly doing Map Reduce jobs is all great and nice, however sometimes we may want or need a simpler and more straightforward interface to execute our analisys jobs. Here is where Apache Pig comes in.

What is so good about Pig is that it offers a nice abstraction on top of map reduce that takes away some of the inherent complexity of programming with this paradigm and instead offers a high level language to achieve our analysis tasks. Pig then translates this high level descriptions into the proper map reduce jobs.

What I will show next is how to integrate (in the simples possible scenario all running in localhost) Pig to extract and analyze data from your Mongo database and store the results back into Mongo.

The example

The example will be a very simplistic insurance policy storage analysis application. Basically we will have a couple of simple MongoDB collections which we will aggregate together and run some analysis on them. The collections we have are:

> show collections
    policies
    providerType

with the following contents:

> db.policies.find()
   { "_id" : ObjectId("5235d8354d17d7080dbd913f"), "provider" : "ic1", "price" : 50 }
   { "_id" : ObjectId("5235d8434d17d7080dbd9140"), "provider" : "ic1", "price" : 200 }
   { "_id" : ObjectId("5235d84d4d17d7080dbd9141"), "provider" : "ic2", "price" : 400 }
   { "_id" : ObjectId("5235d8524d17d7080dbd9142"), "provider" : "ic2", "price" : 150 }

The policies collection have a list of documents informing of sold policies, the insurance provider and the price they were sold for.

> db.providerType.find()
   { "_id" : ObjectId("5235e85794eeca389060cd40"), "provider" : "ic1", "type" : "premium" }
   { "_id" : ObjectId("5235e86194eeca389060cd41"), "provider" : "ic2", "type" : "standard" }

The providerType collection has the type of the particular insurance provider. As example there is premium provider and standard provider.

Our analysis process will simply offer an average of the price spend by buying the premium policies. For that is obvious that we will need a grouping and an aggregation.

Setting up

For the example we will need the following:

After you download (or clone) all the needed software, you proceed to do the following:

  • Build mongo_hadoop: From the root of the cloned repository, after checking out the r1.1.0 tag, execute ./sbt package. This will build the needed jar files for us.
  • From the previous step you would normally need to copy a couple of Jar files to your hadoop environment if we were using full hadoop. However as we will be using just local run of Pig, this is not needed.

You need to have configured some enviornment variables, in particular JAVA_HOME and PATH. I have the following in my .bash_profile:

if [ -f ~/.bashrc ]; then
 . ~/.bashrc
fi
PATH=$PATH:$HOME/bin:/home/vagrant/Programs/pig-0.11.1/bin
export PATH
export JAVA_HOME=/usr/java/default/
export PIG_HOME=/home/vagrant/Programs/pig-0.11.1/

Next you will need the actual Pig script file. Contents below:

REGISTER /home/vagrant/Programs/pig-0.11.1/contrib/piggybank/java/piggybank.jar
REGISTER /home/vagrant/Programs/pig-0.11.1/pig-0.11.1.jar
REGISTER  /home/vagrant/Downloads/mongo-java-driver-2.11.3.jar
REGISTER /home/vagrant/ossources/mongo-hadoop/core/target/mongo-hadoop-core-1.1.0.jar
REGISTER /home/vagrant/ossources/mongo-hadoop/pig/target/mongo-hadoop-pig-1.1.0.jar

raw_providers = LOAD 'mongodb://localhost:27017/insurance_example.policies' using com.mongodb.hadoop.pig.MongoLoader;
raw_provider_types = LOAD 'mongodb://localhost:27017/insurance_example.providerType' using com.mongodb.hadoop.pig.MongoLoader;


with_names = FOREACH raw_providers GENERATE $0#'provider' as provider, (double)$0#'price' as price;
types = FOREACH raw_provider_types GENERATE $0#'provider' as provider, $0#'type' as type;
premiums = FILTER types BY type == 'premium';

by_provider = COGROUP with_names BY provider, premiums BY provider INNER;

averages = FOREACH by_provider GENERATE group, AVG(with_names.price) as avg;

STORE averages
  INTO 'mongodb://localhost:27017/insurance_example.premium_policies_averages'
  USING
  com.mongodb.hadoop.pig.MongoInsertStorage('group:chararray,avg:float', 'group');

This is the file where all the magic happens,to run the file, assuming you called the file insurance.pig as I did, simply execute pig -exectype local -f insurance.pig.

After you run the file, and after a substantial output, you can go to your MongoDB and check the new collection:

> db.premium_policies_averages.find()
  { "_id" : ObjectId("5235f2490cf24ccadbf9638a"), "group" : "ic1", "avg" : 125 }

You can see that it has now stored the average price for all the premium policy providers, in our example the only premium provider is ic1.

Let's have a quick walkthrough over the insurance.pig file

  • The first five lines (The ones starting with REGISTER) simply add the necesary dependencies to the CLASSPATH of the Pig job we are going to run. In this case is worth noting that we are adding the jars from the Mongo Java Driver that we downloaded and the ones we built from the mongo_hadoop project that we cloned from Github.
  • The next 2 lines (statrting with LOAD) loads the data from the necesary mongo collections into their respective relation.
  • In the next 2 lines (start with FOREACH) we give names to the raw data we extracted before for future access.
  • The next line (starts with FILTER) filters the types to only the ones whose type equals 'premium'
  • The next line (starts with COGROUP) creates a combined group between the provider/price touples and the premium providers grouped by provider name.
  • The next line generates the tuples with the corresponding averages.
  • The last part of the script stores the generated tuples into the MongoDB collection, specyfying the type of the data stored.

And that's it. This is a very simplistic example, runnign with very little data and in a local non-distributed environment. However it serves to illustrate how to use the power of Hadoop and Pig to extract and analyze data from MongoDB.

  • I created a Vagrant box with the example which can be downloaded here
  • P.D. Although we did not install Hadoop per se, because it is not needed to install it when running Pig in local mode, Hadoop map-reduce functionality is still used internally by Pig to execute the jobs that we create.

    Monday, July 29, 2013

    Javascript for Java developers 1.Lack of Classes. Instantiating objects

    Lack of classes. Instante objects:

    I have been working mostly in Java in my 9 years development experience. On those years I have worked with many other Java developers and one of the things that many of them have in common is that they don’t like (sometimes even hate Javascript).

    I admit that Javascript has never been my favorite language either, but I do admit that it is a powerful and ever more useful language to know and understand.

    This quick tutorial will try to explain some of the points I think make Java developers feel that they hate Javascript.

    1. Lack of Classes. Instantiating objects
    2. Inheritance and Encapsulation
    3. Package and organizing code
    4. Functional programming and callback functions

    In this first Post I will talk about point number 1.

    Lack of Classes, instantiate objects

    In Java everything we program exists within a class, even the simplest of programs need a class. Classes are used more than anything else as a blueprint for the instantiation of objects. The following code shows a simple class in java:

    class Simple {
      private String propertyA;
      private String propertyB;
    
      public Simple(String a, String b){
        this.propertyA = a;
        this.propertyB = b;
      }
    
      public String concatenateAandB(){
        return propertyA + propertyB;
      }
    }
    

    In the previous chunck of code we have created a very simple Java class. This class has one explicit constructor a couple of properties and a public method. In order to use it we would do something like:

    Simple instance = new Simple("hello ", "world");
    System.out.println(instance.concatenateAandB());
    

    That would print hello world to the console. So how would we do something as simple as that with Javascript. First of all let's remember that there is no such thing as classes in Javascript. However he previous code is easily reproducible in Javascript. I'll show the code and then explain how it works:

    function Simple(a, b) {
      var propertyA = a;
      var propertyB = b;
    
      this.concatenateAandB = function() {
        return propertyA + propertyB;
      }
    }
    

    The way to execute this is very similar to the Java version. We create and instance and call the method:

    var instance = new Simple("hello", "world");
    console.log(instance.concatenateAandB());
    

    As you can see the functionality here is very similar, the main difference is that instead of having a class and a constructor for that class, we have an isolated function that by convention starts with a capital letter when we want it to work as a constructor (there is nothing appart from the convention that stop us from doing function simple(a, b) {). Simple is a standard Javascript function. However as you can see we called it in a non-conventional way using the new keyword. You could call this function like Simple("hello", "world") as the following chunck of code shows:

    Simple("hello ", "world");
    console.log(window.concatenateAandB());  
    

    In this second scenario we can see that we are not creating a new instance with new, instead we are calling the function directly as any other Javascript function. This has the effect of calling the function using the current object as the this object (in this case the window object). When using the new keyword, instead of using the current object as this, a new empty object is created and that one is used as the context in the function. Also when using new the created instance object is returned by the method call.

    This is one of the most important things to know when working with objects in Javascript. The functions behave differently when called using new or when not using new. In the first case a new empty object ({}) is created and used as the this context inside the function. When the function is called without the new keyword, the current this object (the window if you are calling from the top level) that current object is used as the context inside the function and no new instance is created.

    This makes Javascript a bit confusing when creating objects and not having real difference between a standard function and a constructor function. Also it is strange that constructor functions can live all on their own while in Java they are logically living inside class definitions.

    In the next Post I will explore how Javascript deals with inheritance and encapsulation, and compare it to Java as I just did here.

    Wednesday, April 17, 2013

    Pro Spring Security and OAUTH 2

    OAuth and Spring Security

    My upcoming Pro Spring Security is heavily focused on the inner workings of the Spring Security core framework and how everything fit together under the hood.

    And although I do cover very important providers for authentication and authorization (including LDAP, Database, CAS, OpenID, etc) I don’t cover another important provider which is OAuth. The reason it is not covered in the book is because it is not part of the core of Spring Security but instead it is part of the Spring Security extensions project. However I know OAuth is a very popular authorization protocol and I will cover it here as a complement to the information on the book.

    In this Post I want to introduce you to using OAuth with Spring Security. Many of the concepts will not be straightforward to understand, and I recommend you to read the book Pro Spring Security to understand the architecture and design of Spring Security and how it works internally.

    First let’s take an overall look at the OAuth 2 protocol. I won’t be talking anything about the original OAuth 1 in this post.

    OAuth 2 is an authorization protocol that specifies the ways in which authorization can be granted to certain clients to access a determined set of resources.

    Directly taken from the specification, we can see in the following paragraph that OAuth 2 defines a specific set of Roles interacting in the security process:

    resource owner
         An entity capable of granting access to a protected resource.
         When the resource owner is a person, it is referred to as an end-
         user.
     
    resource server
         The server hosting the protected resources, capable of accepting
         and responding to protected resource requests using access tokens.
     
    client
         An application making protected resource requests on behalf of the
         resource owner and with its authorization.  The term client does
         not imply any particular implementation characteristics (e.g.
         whether the application executes on a server, a desktop, or other
         devices).
     
    authorization server
         The server issuing access tokens to the client after successfully
         authenticating the resource owner and obtaining authorization.

    The overall interaction that happens in the protocol is also specified in the specification, and it goes something like this:

    You can see the interaction between the 4 roles defined by the protocol.

    The main idea to take in mind in the traditional scenario is that an application (the Client) will try to access another application’s secured resources (Resource Server) in the name of a user (Resource Owner). In order to do so, the user needs to prove its identity, and to do so it contacts the Authorization Server presenting its credentials. The Authorization Server authenticates the user and grants the client/user combination a token that can be used to access the particular resource. In many cases the Authorization Server and Resource Server live in the same place. For example if you try to access your facebook account from another application, you will get redirected to the facebook login (Authorization Server) and then you will grant certain permissions to be able to access your facebook wall (a resource in the Resource Server).

    Ok, let’s see how all that fits in the Spring Security OAuth project. There will be a couple of things that don’t work exactly the way that sites like Facebook or Twitter work, but we get to that later.

    I will create both a OAuth service provider (Authentication Server but also Resource Server) and a OAuth client (Client Application) The first thing I’ll do is to check out the source of the project from Github, which I normally do with open source projects.

    git clone git://github.com/SpringSource/spring-security-oauth.git

    Then checkout the latest stable tag with git checkout 1.0.1.RELEASE

    Next we will be playing with the OAuth 2 samples that come with the project.

    I build everything with the command mvn clean install -P bootstrap

    We will then open the project sparklr in the samples/oauth2 folder. This project will serve as both the Authorization Server and Resource Server from our previous definitions. Sparklr is a project that offers pictures secured with OAuth.

    We will also open the tonr application in the same directory. This is the OAuth client application.

    Both projects are built from the previous maven build step. I took both war files and copy them to a Tomcat 7’s webapps directory that I have in my computer under the names sparklr2.war and tonr2.war. The I started Tomcat and visited the address http://localhost:8080/tonr2/. I was received by the page in the following figure:

    If I click in the “sparklr pics” tab I am presented with a login screen particular to the tonr application (The Client in this case has a password of its own):

    I login with the default username and password and I am redirected to the sparklr login screen now.

    Remember this is the Authorization Server as well as the Resource Server, where the resources are pictures.

    Again I login with the default username and password. This will ask for confirmation that I am allowing Tonr to access my personal resources on my behalf.

    I click on Authorize. I am redirected back to Tonr and now I can see my pictures stored on Sparklr.

    Ok, that is the functionality working. Now let’s see how it all happens under the hood.

    When we visit the URL http://localhost:8080/tonr2/sparklr/photos by clicking on the tab, the request will arrive at the filter OAuth2ClientContextFilter which is defined by the xml namespaced element <oauth:client id="oauth2ClientFilter" /> in the spring-servlet.xml of the tonr project, which is referenced in the <http> element at the top of said file. Like this:

    <http access-denied-page="/login.jsp?authorization_error=true" xmlns="http://www.springframework.org/schema/security">

                    <intercept-url pattern="/sparklr/**" access="ROLE_USER" />

                    <intercept-url pattern="/facebook/**" access="ROLE_USER" />

                    <intercept-url pattern="/**" access="IS_AUTHENTICATED_ANONYMOUSLY" />

                    <form-login authentication-failure-url="/login.jsp?authentication_error=true" default-target-url="/index.jsp"

                            login-page="/login.jsp" login-processing-url="/login.do" />

                    <logout logout-success-url="/index.jsp" logout-url="/logout.do" />

                    <anonymous />

                    <custom-filter ref="oauth2ClientFilter" after="EXCEPTION_TRANSLATION_FILTER" />

            </http>

    You can see that the access needed for accessing anything with the URL /sparklr/** requires a user with Role ROLE_USER. That means that when trying to access that URL before being logged in an AccessDeniedException will be thrown. The AccessDeniedException will be catched by the OAuth2ClientContextFilter which won’t do anything at all with this exception but rethrow it to be handled by the standard exception handling process of Spring Security.

    The application is automatically redirected to the login.jsp.

    After we login and still trying to reach the URL /sparklr/photos, the OAuth2ClientContextFilter is reached again. The request will get all the way through the SparklrServiceImpl which is the service in charge of reaching to Sparklr to retrieve the photos. This class will use a RestOperations instance (in particular an instance of OAuth2RestTemplate) to try to retrieve the photos from the remote Sparklr service using the following URL http://localhost:8080/sparklr2/photos?format=xml

    This is what the OAuth2RestTemplate does internally when trying to contact the given URL:

    This is what the OAuth2RestTemplate does internally when trying to contact the given URL:

    It will first try to obtain the Access Token (in the form of an instance of OAuth2AccessToken) from the client context. In particular from the configured DefaultOAuth2ClientContext.

    As we still don't have any access token for Sparklr, there is nothing stored on the DefaultOAuth2ClientContext. So the access token will be null at this point.

    Next a new AccessTokenRequest will be obtained from the DefaultOAuth2ClientContext. The AccessTokenRequest and the DefaultOAuth2ClientContext are configured automatically by Spring at startup by the use of the namespaced bean definition <oauth:rest-template resource="sparklr" />. That definition will be parsed by the class RestTemplateBeanDefinitionParser.

    Next as there is no access token on the context, an instance AccessTokenProvider will be used to try and retrieve the token. In particular an instance of AccessTokenProviderChain will be used. The AccessTokenProviderChain’s obtainAccessToken method will be called with the AccessTokenRequest and a AuthorizationCodeResourceDetails instance. This instance is configured by the resource attribute in the <oauth:rest-template> bean that references a <oauth:resource> element. Like this:

    <oauth:resource id="sparklr" type="authorization_code" client-id="tonr" client-secret="secret"

                    access-token-uri="${accessTokenUri}" user-authorization-uri="${userAuthorizationUri}" scope="read,write" />

    <oauth:rest-template resource="sparklr" />

    The <oauth:resource> element will be parsed by the class ResourceBeanDefinitionParser at startup. In this particular case this will create an instance of the class AuthorizationCodeResourceDetails. In this instance, preconfigured client-id and client-secret will be set. Also configured in this instance are the required URLs for authenticating and requesting the Access Token.

    The AccessTokenProviderChain’s obtainAccessToken method will go through the configured AccessTokenProvider instances to try and obtain the token. In the current example the only configured AccessTokenProvider is an instance of AuthorizationCodeAccessTokenProvider. This class will try to make a POST call to the URL http://localhost:8080/sparklr2/oauth/authorize to get the authorization code for this client/user.

    The call to the /sparklr2/oauth/authorize endpoint will arrive on the FilterSecurityInterceptor for the Sparklr application. The interceptor will notice that the required URL requires a user with role ROLE_USER (in Sparklr) to be logged in to be able to access the contents from this endpoint. This is enforced by the following chunk of XML in the spring-servlet.xml of the Sparklr application (In particular look at the line that intercepts the URL “/oauth/**”):

    <http access-denied-page="/login.jsp?authorization_error=true" disable-url-rewriting="true"

                    xmlns="http://www.springframework.org/schema/security">

                    <intercept-url pattern="/oauth/**" access="ROLE_USER" />

                    <intercept-url pattern="/**" access="IS_AUTHENTICATED_ANONYMOUSLY" />

                    <form-login authentication-failure-url="/login.jsp?authentication_error=true" default-target-url="/index.jsp"

                            login-page="/login.jsp" login-processing-url="/login.do" />

                    <logout logout-success-url="/index.jsp" logout-url="/logout.do" />

                    <anonymous />

            </http>

    When the interceptor realizes that the security constraint is not met, it will throw an AccessDeniedException. This exception will be picked up by the ExceptionTranslationFilter which will at the end of a couple more redirections, will redirect to the URL http://localhost:8080/sparklr2/login.jsp.

    This redirect will be received by the AuthorizationCodeAccessTokenProvider in the Tonr application which will then throw the exception UserRedirectRequiredException. That exception will be then be captured by the Oauth2ClientContextFilter which will make the call to the redirect URL returned by Sparklr and adding the required extra parameters. Something like: http://localhost:8080/sparklr2/login.jsp?response_type=code&client_id=tonr&scope=read+write&state=dWby7l&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2Ftonr2%2Fsparklr%2Fphotos

    We are presented now with the login form for Sparklr. When we click on the login button, the following happens:

    Username and password will be send to the URL /sparklr2/login.do. This URL is configured to be managed by the UsernamePasswordAuthenticationFilter when the attribute login-processing-url="/login.do" is configured on the <form-login> element. The UsernamePasswordAuthenticationFilter will extract the username and password from the request and will call the ProviderManager implementation of AuthenticationManager.

    The AuthenticationManager will find the DaoAuthenticationProvider that is configured with in memory user details thanks to the XML element show next:

            <authentication-manager alias="authenticationManager" xmlns="http://www.springframework.org/schema/security">

                    <authentication-provider>

                            <user-service id="userDetailsService">

                                    <user name="marissa" password="koala" authorities="ROLE_USER" />

                                    <user name="paul" password="emu" authorities="ROLE_USER" />

                            </user-service>

                    </authentication-provider>

            </authentication-manager>

    The DaoAuthenticationProvider will be able then to find the user marissa with the corresponding role ROLE_USER and authenticate her.

    After succesfully authenticating in Sparklr, the framework redirects to the  /sparklr2/oauth/authorize URL that was previously requested. This time access to the endpoint will be granted for the user that had just logged in.

    The /oauth/authorize request in Sparklr is handled by the class org.springframework.security.oauth2.provider.endpoint.AuthorizationEndpoint that comes with the installation of the core Spring Security OAuth project. The use of this endpoint class is configured automatically when the following XML configuration is used:

    <oauth:authorization-server client-details-service-ref="clientDetails" token-services-ref="tokenServices"

                    user-approval-handler-ref="userApprovalHandler">

                    <oauth:authorization-code />

                    <oauth:implicit />

                    <oauth:refresh-token />

                    <oauth:client-credentials />

                    <oauth:password />

            </oauth:authorization-server>

    When that xml element (<oauth:authorization-server> ) is used, many things happen when it is parsed by the class AuthorizationServerBeanDefinitionParser which is a complex parser that handles many different elements and options. Some of them we will see later, but for now we are only looking at the AuthorizationEndpoint.

    The AuthorizationEndpoint maps the request /oauth/authorize as I have just said before. The first thing it will do is to check the kind of petition that the client is doing, basically looking at the request param response_type and checking if its value is either “code” or “token” to know if an authorization code or an access token is being requested. For the current call, the value of this parameter is “code” as you may remember.

    The next thing that will happen is that the AuthorizationEndpoint will create an instance of org.springframework.security.oauth2.provider.AuthorizationRequest. The main part of the AuthorizationRequest, is the use of the clientId parameter to retrieve the client details of the connecting client. In this requets, the clientId is “tonr” and the details are retrieved from an InMemoryClientDetailsService that get configured by the XML element:

    <oauth:client-details-service id="clientDetails">

                    <oauth:client client-id="my-trusted-client" authorized-grant-types="password,authorization_code,refresh_token,implicit"

                            authorities="ROLE_CLIENT, ROLE_TRUSTED_CLIENT" scope="read,write,trust" access-token-validity="60" />

                    <oauth:client client-id="my-trusted-client-with-secret" authorized-grant-types="password,authorization_code,refresh_token,implicit"

                            secret="somesecret" authorities="ROLE_CLIENT, ROLE_TRUSTED_CLIENT" />

                    <oauth:client client-id="my-client-with-secret" authorized-grant-types="client_credentials" authorities="ROLE_CLIENT"

                            scope="read" secret="secret" />

                    <oauth:client client-id="my-less-trusted-client" authorized-grant-types="authorization_code,implicit"

                            authorities="ROLE_CLIENT" />

                    <oauth:client client-id="my-less-trusted-autoapprove-client" authorized-grant-types="implicit"

                            authorities="ROLE_CLIENT" />

                    <oauth:client client-id="my-client-with-registered-redirect" authorized-grant-types="authorization_code,client_credentials"

                            authorities="ROLE_CLIENT" redirect-uri="http://anywhere?key=value" scope="read,trust" />

                    <oauth:client client-id="my-untrusted-client-with-registered-redirect" authorized-grant-types="authorization_code"

                            authorities="ROLE_CLIENT" redirect-uri="http://anywhere" scope="read" />

                    <oauth:client client-id="tonr" resource-ids="sparklr" authorized-grant-types="authorization_code,implicit"

                            authorities="ROLE_CLIENT" scope="read,write" secret="secret" />

            </oauth:client-details-service>

    As you can see in the XML chunk, “tonr” is configured as a client. When we retrieve the details of the client we get an instance of the ClientDetails object. ClientDetails source is shown next:

    package org.springframework.security.oauth2.provider;

    import java.io.Serializable;

    import java.util.Collection;

    import java.util.Map;

    import java.util.Set;

    import org.springframework.security.core.GrantedAuthority;

    /**

     * Client details for OAuth 2

     *

     * @author Ryan Heaton

     */

    public interface ClientDetails extends Serializable {

            /**

             * The client id.

             *

             * @return The client id.

             */

            String getClientId();

            /**

             * The resources that this client can access. Can be ignored by callers if empty.

             *

             * @return The resources of this client.

             */

            Set<String> getResourceIds();

            /**

             * Whether a secret is required to authenticate this client.

             *

             * @return Whether a secret is required to authenticate this client.

             */

            boolean isSecretRequired();

            /**

             * The client secret. Ignored if the {@link #isSecretRequired() secret isn't required}.

             *

             * @return The client secret.

             */

            String getClientSecret();

            /**

             * Whether this client is limited to a specific scope. If false, the scope of the authentication request will be

             * ignored.

             *

             * @return Whether this client is limited to a specific scope.

             */

            boolean isScoped();

            /**

             * The scope of this client. Empty if the client isn't scoped.

             *

             * @return The scope of this client.

             */

            Set<String> getScope();

            /**

             * The grant types for which this client is authorized.

             *

             * @return The grant types for which this client is authorized.

             */

            Set<String> getAuthorizedGrantTypes();

            /**

             * The pre-defined redirect URI for this client to use during the "authorization_code" access grant. See OAuth spec,

             * section 4.1.1.

             *

             * @return The pre-defined redirect URI for this client.

             */

            Set<String> getRegisteredRedirectUri();

            /**

             * Get the authorities that are granted to the OAuth client. Note that these are NOT the authorities that are

             * granted to the user with an authorized access token. Instead, these authorities are inherent to the client

             * itself.

             *

             * @return The authorities.

             */

            Collection<GrantedAuthority> getAuthorities();

            /**

             * The access token validity period for this client. Null if not set explicitly (implementations might use that fact

             * to provide a default value for instance).

             *

             * @return the access token validity period

             */

            Integer getAccessTokenValiditySeconds();

            /**

             * The refresh token validity period for this client. Zero or negative for default value set by token service.

             *

             * @return the refresh token validity period

             */

            Integer getRefreshTokenValiditySeconds();

            /**

             * Additional information for this client, not neeed by the vanilla OAuth protocol but might be useful, for example,

             * for storing descriptive information.

             *

             * @return a map of additional information

             */

            Map<String, Object> getAdditionalInformation();

    }

    After the AuthorizationRequest object is created, it is checked to see if it has been approved. As it is just created, it has not been yet approved and the AuthorizationEndpoint will forward the request to the URL forward:/oauth/confirm_access which is internally configured by default in the framework.

     The confirm_access URL needs to be handled in the application that we are using as authentication server, which is in this case is the Sparklr application. So in the Sparklr application the AccessConfirmationController is in charge of handling the request to the /oauth/confirm_access URL. It handles this URL by showing the page access_confirmation.jsp that contains the form for authorizing or denying access to the tonr client to the sparklr server

    Now when you click on Authorize, the following happens:

    A POST call is made to the URL /oauth/authorize in the Sparklr application. This call is handled by the class org.springframework.security.oauth2.provider.endpoint.AuthorizationEndpoint, in particular by the approveOrDeny method.

    The approveOrDeny method will extract the AuthorizationRequest that is requesting access. The AuhtorizationRequest interface looks like:

    package org.springframework.security.oauth2.provider;

    import java.util.Collection;

    import java.util.Map;

    import java.util.Set;

    import org.springframework.security.core.GrantedAuthority;

    /**

     * Base class representing a request for authorization. There are convenience methods for the well-known properties

     * required by the OAUth2 spec, and a set of generic authorizationParameters to allow for extensions.

     *

     * @author Ryan Heaton

     * @author Dave Syer

     * @author Amanda Anganes

     */

    public interface AuthorizationRequest {

            public static final String CLIENT_ID = "client_id";

            public static final String STATE = "state";

            public static final String SCOPE = "scope";

            public static final String REDIRECT_URI = "redirect_uri";

            public static final String RESPONSE_TYPE = "response_type";

            public static final String USER_OAUTH_APPROVAL = "user_oauth_approval";

            public Map<String, String> getAuthorizationParameters();

            

            public Map<String, String> getApprovalParameters();

            public String getClientId();

            public Set<String> getScope();

            public Set<String> getResourceIds();

            public Collection<GrantedAuthority> getAuthorities();

            public boolean isApproved();

            public boolean isDenied();

            public String getState();

            public String getRedirectUri();

            public Set<String> getResponseTypes();

    }

    The redirectUri is extracted and the approval process is evaluated again, this time approving the AuthorizationRequest.

    The request is also evaluated to make sure it is requesting the authorization code (by calling the method getResponseTypes on the AuthorizationRequest) and not the token. The framework will generate a new authorization code as requested. Then the approveOrDeny method will finish its work by returning a new instance of RedirectView that will be handled as a redirection back to the tonr application. In particular to the URL http://localhost:8080/tonr2/sparklr/photos;jsessionid=03B2E814391E010B3D1210241ECF6C0A?code=vqMbuf&state=aTSlVl in my current example.

    The redirect arrives back on Tonr. Now Tonr will process the URL again, but this time it will try to obtain the access token in order to continue with the processing. As before, Tonr will use the OAuth2RestTemplate in combination with the AuthorizationCodeAccessTokenProvider to try and retrieve the access token from Sparklr. In particular they will make a request to the URL /sparklr/oauth/token with the parameters {grant_type=’authorization_code’, redirect_uri=’http://localhost:8080/tonr2/sparklr/photos’, code=xxxx

    That call to Sparklr will be handled by the class org.springframework.security.oauth2.provider.endpoint.TokenEndpoint. This endpoint will create a new instance of OAuth2AccessToken and generate a response with that Bearer token.

    Again Tonr receives this response and now tries again to make the call to Sparklr this time it will pass the Bearer access token, received in the previous step, as part of the Authorization header in the request.

    That is all Sparklr needs to verify the request. The OAuth2AuthenticationProcessingFilter will extract the token from the request and with the help of the Oauth2AuthenticationManager will validate and authenticate the token.

    When the token is verified, the call will finally arrive at the PhotoController which is in charge of serving the photos. Later in the week I will add a couple of UML diagrams that will help understand the relationship between the major classes involved in the OAuth for Spring Security project.

    To understand the internal details of how the Spring Security framework work on the core you can check out the book