_ _ _ _
_ __ ___ (_) _ __ | |__ __ _ ___ | |__ (_) ___
| '_ ` _ \ | | | '_ \ | '_ \ / _` | / __| | '_ \ | | / __|
| | | | | | | | | | | | | | | | | (_| | \__ \ | | | | _ | | \__ \
|_| |_| |_| |_| |_| |_| |_| |_| \__,_| |___/ |_| |_| (_) _/ | |___/
|__/
Minhashing is an efficient similarity estimation technique that is often used to identify near-duplicate documents in large text collections. This package offers a JavaScript implementation of the minhash algorithm and an efficient Locality Sensitive Hashing Index for finding similar minhashes in Node.js or web applications.
Installation
To get started with Minhash.js, you can install the package with npm:
npm install minhash --save
If you prefer, you can instead load the package directly in a browser:
Minhash Usage
Minhashes are hash representations of the contents within a set. The following example minhashes and then estimates the Jaccard similarity between two sets:
; // If using Node.js var s1 = 'minhash' 'is' 'a' 'probabilistic' 'data' 'structure' 'for' 'estimating' 'the' 'similarity' 'between' 'datasets';var s2 = 'minhash' 'is' 'a' 'probability' 'data' 'structure' 'for' 'estimating' 'the' 'similarity' 'between' 'documents'; // create a hash for each set of words to comparevar m1 = ;var m2 = ; // update each hashs1;s2; // estimate the jaccard similarity between two minhashesm1;
LshIndex Usage
While one can compare the Jaccard similarity between a minhash and all others in a collection, the complexity of doing so is O(n), as one needs to compare the query set to every other set.
To estimate the results of the same comparison in sub-linear time, one can instead build a Locality Sensitive Hash Index, which maps hash sequences from a minhash signature to the list of document identifiers that contain the given hash sequence. Using this indexing technique, one can effectively find sets similar to a query set:
; // If using Node.js var s1 = 'minhash' 'is' 'a' 'probabilistic' 'data' 'structure' 'for' 'estimating' 'the' 'similarity' 'between' 'datasets';var s2 = 'minhash' 'is' 'a' 'probability' 'data' 'structure' 'for' 'estimating' 'the' 'similarity' 'between' 'documents';var s3 = 'cats' 'are' 'tall' 'and' 'have' 'been' 'known' 'to' 'sing' 'quite' 'loudly'; // generate a hash for each list of wordsvar m1 = ;var m2 = ;var m3 = ; // update each hashs1;s2;s3; // add each document to a Locality Sensitive Hashing indexvar index = ;index;index;index; // query for documents that appear similar to a query documentvar matches = index;console;
Example
The sample application uses minhash.js to compute the similarity between several sample documents:
There is also a sample Node.js script that can be run with node examples/index.js
.
Development
To run the test suite — npm run test
.
To compile and minify minhash.min.js — npm run build
.