Read the first part in my Data Mining series.
Scary words: Extracting Characterizing Features and Locality Sensitive Hashing, Signature Matrix
There. Now that's out of our way.
Scenario: There is a big bunch of objects (documents, pictures, videos) that you want to compare. You want all objects which are more that 85% similar (We will get to similarity in another post. For the moment just accept my non-existent expertise, dammit!).
Assumming we can extract some features from these objects which characterizes them, we can collect the objects (e.g documents), which are 85% similar.
As long as you have these features for the objects, it doesn't matter what the objects are. They could be pictures of salads, or mp3-files or documents.
Extracting Characterizing Features
This is almost a field of it's own, but for documents there is a good method: Shingles. (Which was covered in the First Part)
We don't want to use number character occurrances, because that doesnt take words or word order into account.
What about the length of the shingles? Well, there is no magic number. But "not to long and not to short" is the most specific answer that can be given (translates into 5-14 characters...maybe). It really depends.
Jacobian Similarity (...)