Process your massive word2vec binary model file as a readable stream of records.
Word2vec models are typically distributed as massive binary files (for instance, the standard GoogleNews set is several gigs once unzipped). In some cases, you may wish to process these models and persist all or part of their contents to a database or other source, without hitting the considerable memory usage needed to read it all into memory at once.
This tiny library is merely a handy function that parses the binary format and offers a readable stream of objects containing the word and the value (vector array).
The function exported by
word2vec-stream returns a promise, which resolves to a readable stream:
const word2vecStream = ;;
A single word object looks like this:
word: 'runs'values:-003380169719457626005194384977221489-0037048187106847760016614392399787903006607563048601150030364234000444412-0028072593733668327-016270646452903748-0038575947284698486012756797671318054// ... as many floats as vector dimensions here}
Or examine and run the demo.js file for a quick example (dumping records to console). Included tests also demonstrate basic invocation.
$ node demo.js
This library targets node v8, though may work a little further back; some necessary elements of the stream.Readable API may not be supported in older versions.
Thanks to node-word2vec for illustrating the basic syntax of parsing the binary format in node.