Parse WARC (Web Archive Files) as a node.js stream
This stream parses the Web Archive file format as used by the Common Crawl project.
NB: That this stream doesn't do any gzip decompression, it assumes a
decompressed WARC file format. The WARC files that use used by common-crawl
are actually multi-part Gzip files, and there is a big bug with the
zlib library which is present as of the time of writing (node
will only process the first gzipped chunk.
This module is installed via npm:
$ npm install warc
Assumes an uncompressed WARC stream. The
content field will be returned as a
var WARCStream =fs = ;var w = ;fs;