Read the first part in my Data Mining series.

Scary words: *Extracting Characterizing Features and Locality Sensitive Hashing, Signature Matrix*

There. Now that's out of our way.

**Scenario:** There is a big bunch of objects (documents, pictures, videos) that you want to compare. You want all objects which are more that 85% similar (We will get to similarity in another post. For the moment just accept my non-existent expertise, dammit!).

Assumming we can extract some features from these objects which characterizes them, we can collect the objects (e.g documents), which are 85% similar.

As long as you have these features for the objects, it doesn't matter *what* the objects are. They could be pictures of salads, or mp3-files or documents.

### Extracting Characterizing Features

This is almost a field of it's own, but for documents there is a good method: Shingles. (Which was covered in the First Part)

We don't want to use number character occurrances, because that doesnt take words or word order into account.

What about the length of the shingles? Well, there is no magic number. But "not to long and not to short" is the most specific answer that can be given (translates into 5-14 characters...maybe). It really depends.

### Jacobian Similarity

Jacobian Similarity is really simple. In mathematical terms it is simply the size of theintersection of A and B divided by the size of the union of A and B. Or:

**Example**: Given the set `a = {1,2,3}`

and `b = {1,2}`

, we get that the size of the intersection `{1,2}`

is `2`

, and the size of the union `{1,2,3}`

is `3`

, so the Jacobian Similarity becomes `J(a,b) = 2/3`

. Which makes sense, since we we can almost intuitively see that they are ⅔ "alike".

This is how we're going to compare our objects (be it documents or videos). We're going to have some set of shingles representing each object, then we're going to apply the Jacobian similarity to each of them and choose the ones which are over 85% duplicates (as stated as our goal in the "scenario").

### Shingles Revisited

Given a set of objects, e.g. `{A,B,C}`

which have extracted some shingles from and the totla number of all the shingles are, say, 6.

**Remember:** A shingle is some type of feature characterizing an object. It could be that, for example if the objects are picture, that then a feature could be that the picture contains the colour red somewhere. Or it contains a boat. Or a person. Or something. Just something that makes this picture unique (in some sense).

Not all of the object will have *all* shingles, but everyone of the objects will have some shingles (without loss of generality). So if we wanted to represent an object, say `A`

, we could do that with an array: `[0,0,0,1,0,0]`

. This would mean that `A`

has shingle/feature number 4 (maybe `A`

is a picture containing a boat and shingle/feature #4 represents a boat).

### Locality Sensitive Hashing, or: "You Have Many Objects"

Indeed. So we're not going to use Jacobian Similarity. ... Wait, what?

Yes. Jacob is too slow! We are going to use something more efficient but less accurate; enter: Locality Sensitive Hashing (LSH).

This is a really fancy name. The concept is a bit convoluted, but it is simple stuff convoluted with simple stuff. So it's "easy" to understand and implement.

**NOTE**: *If we would have a small set of objects* we would NOT use LSH. It is used purely because comparing millions of objects explicity using Jacobian Similarity is *really* expensive time (and energy) wise.

The LSH Idea: Create some sort of function which puts some objects into one "bucket" and some other objects into another "bucket" based on the features/shingles. This would then lead to that similar objects are grouped together.

The Method: Min-Hashing! (Yes yes i know! Min-Hashing was not in the scary word list, but it is not scary; read on...)

Min-Hashing means that we count the number of zeros in the array until we get to a one.

**Example**: Given `[0,0,1,0,1,0]`

, we would get `3`

because the first `1`

appears at the third position.

Okay, there is one small thing more about Min-Hashing: We will permutate the array randomly. (All object's arrays will be permutated exactly in the same way).

**Example**: So the array `[0,0,1,0,1,0]`

form last example would become (maybe, it is random after all): `[0,1,1,0,0,0]`

and so the min-hash would be `2`

.

So if we did this to 4 different object, we would maybe get the result `[3,2,1]`

where each number in the array represent the min-hash of that object. So: `[3,2,1]`

would mean that object number #1 has at least feature number #3 (but not feature #1 and #2), and object number #2 has at least feature number #2 (but not feature #1), and object number #3 has at least feature number #3. (The feature "numbers" or feature indexes are permutated, but since thay have been permutated the same for all objects when comparing it doesnt matter)

**Example**: Given three (3) objects, six (6) possible features, and a permutation `[5,2,4,6,3,1]`

, we could get the "result" `[3,2,1]`

:

So, an explanation of this picture:

The number `3`

in the right hand side `[3,2,1]`

array comes from that the first column in the big matrix has it's first `1`

at position three counting according to the small (permutation) matrix. And so on.

#### Cool, i guess... But what are you on about?

Min-Hashing has the nice property that the probability of two objects getting the same min-hash is the same as the Jacobian similarity of the two objects. Or:

Which is nice. But it has a flaw: given that it is a *probability*, we will miss 15% if we want 85% similarity. (Since the probability of two objects being in the same "bucket" is 85% (given that this is our threshold)).

To counter this: Run the algorithm multiple times using different permutations! It is relatively cheap to run and the more you run it the better the results. (there is mathematical theory behind this which im not going to bother putting here).

So it would result in:

Here we only run it two times, but this is just to show how the result expands when run many times. The matrix on the right hand side is called a *Signature Matrix*.

Repeating the min-hasing will make the CDF (Cumulative Distribution Function) of the probability of putting two similar objects into the same "bucket" approximate the step-function. Which is very nice.

This was a LOT of theory. Stay tuned for part three where all of this will be applied and implemented as a MapReduce program in Python!

Cheers :)