Data Mining 101: MapReduce 4 years ago

I am currently taking a course in Data Mining in which i have been introduced to a pretty cool model for distributed parallell computing: MapReduce.

It basically provides an environment in which you (as a programmer) only has to worry about two functions: Map and Reduce. It may sound a bit abstract and complex right now, but in reality it is simple.

The first step is Map which takes a list of key-value pairs and outputs a list of key-value pairs. There is no constraints on the form of the values ans keys and no contstraint on the relationship between the input and the output.

The second is the Reduce step, which takes a list of key-value pairs and outputs some list as a result. Here there is a constrain on the input: the value for the key-value pair should be a list.

There is a "group by key and sort" function in between these steps that the MapReduce environment takes care of.

More Concise:

  • Read a lot of data
  • Map: Extract something we care about
  • Shuffe & Sort
  • Reduce: Aggregate, Summatrize, Filter or Transform
  • Write Result

And More Formally:

  • map(k1,v1) -> list(k2, v2)
  • reduce(k2, list(v2)) -> list(v3)

There are some subtleties here that do matter. Notice how the input to reduce has the same key as the output from map and the value is a list which consists of all the values which had the key k2, this is a key concept of MapReduce.

Basic Example: Word Counting

A simple way to understand MapReduce is to look at a really simple application; counting words.

Fig 1: Concept of couting words using Map Reduce

If you study this diagram, you will notice that the output from the map function has the : 1 three times, instead of the : 3.

The output from the map-function then goes into the Group & Sort step which the environment takes care of, and the output from that is given to the reduce-function which does the actual counting and outputs the correct result.

Now, you may be thinking that this is a perfect example of over-architecting. And you would be right if the subtleties of the method were ignored. There are some properties of the word-counting method that may be easily over-looked:

1. Map is Sequential

Every read and every count produced form the map function (in MapReduce terms you say that the functions emits values) is/can be done sequentially.

2. Reduce is Sequential

Since the input to the reduce-function is sorted on the words, the number of occurrences for each word can be counted sequentially, e.g.

for each (key, value) in input
  if last_key == key
    emit(last_key, count)
    count = 1
    last_key = key

3. This implies distributed parallellization is feasible

Imagine if you would split the original text into two (2) parts, and send the two parts to two difference map-functions. You would then wait for the result from both of the map-functions, and then sort the combined result. Then you would split this result in two parts and send each part to it's own reduce-function, and again wait for both to finish and when they have you would present the result by combining the outputs. You would get the correct word count!

This makes counting word occurrences on billions of documents feasible since you could distribute small parts of each document to many Map & Reduce functions.

Moving On: Counting Shingles

If you want to compare a set of documents and find which ones are "similar", words counting may not be the best metric of similarity. Another option is to use shingles. A k-shingle is a k character long string. We would then use a set (i.e. not a bag (i.e. no duplicate elements)) of all the shingles that occured in a document to represent that document.


Say we have "abcdbcd", then the set of 2-shingles representing this string would be {ab, bc, cd, db}. Notice that bc and cd only occures once in the set of shingles, even though they appear twice in the original string.

Ok, carry on...

So I am saying that using set of k-shingles is better than counting word occurences. This is why:

  1. Two documents can share a lot of words but have completely different subjects
  2. Shingles take word sequences into account

So what Map-function and what Reduce-function should you use to count shingles?

Well, it is fundementally the same.

  • Map: Emit (k-word sequence, count) from document
  • Reduce: Present Count

What's next?

If you didn't understand something or if I did something wrong (which i probably did), just add a comment.

Otherwise, i will post a Data Mining part 2 describing how one can compare documents using shingles and how to implement it in MapReduce.

Tweet about this post