warc
Parse WARC (Web Archive Files) as a node.js stream
This stream parses the Web Archive file format as used by the Common Crawl project.
NB: That this stream doesn't do any gzip decompression, it assumes a
decompressed WARC file format. The WARC files that use used by common-crawl
are actually multi-part Gzip files, and there is a big bug with the zlib
library which is present as of the time of writing (node 0.10.32
) which
will only process the first gzipped chunk.
Installation
This module is installed via npm:
$ npm install warc
Example Usage
Assumes an uncompressed WARC stream. The content
field will be returned as a
node Buffer
.
var WARCStream =fs = ;var w = ;fs;