node-warc
Parse Web Archive (WARC) files or create WARC files using
Run npm install node-warc
or yarn add node-warc
to ge started
Documentation
Full documentation available at n0tan3rd.github.io/node-warc
Parsing
Using async iteration
Requires node 10 or greater
const fs = const zlib = // recordIterator only exported if async iteration on readable streams is availableconst recordIterator = { for await const record of console }
Or using one of the parsers
for await const record of '<path-to-warcfile>' console
Using Stream Transform
const fs = const WARCStreamTransform = fs
.warc
and .warc.gz
Both const AutoWARCParser = const parser = '<path-to-warcfile>'parserparserparserparserstart
Only gzip'd warc files
const WARCGzParser = const parser = '<path-to-warcfile>'parserparserparserparserstart
Only non gzip'd warc files
const WARCGzParser = const parser = '<path-to-gzipd-warcfile>'parserparserparserparserstart
WARC Creation
Environment
NODEWARC_WRITE_GZIPPED
- enable writing gzipped records to WARC outputs.
Examples
chrome-remote-interface
Usingconst CRI = const RemoteChromeWARCGenerator RemoteChromeCapturer = ;async { const client = await await Promiseall clientPage clientNetwork const cap = clientNetwork cap await clientPage; // actual code should wait for a better stopping condition, eg. network idle await clientPage const warcGen = await warcGen await client}
chrome-remote-interface-extra
Usingconst CRIExtra Events Page = const CRIExtraWARCGenerator CRIExtraCapturer = ;async { let client try // connect to endpoint client = await const page = await Page const cap = page EventsPageRequest cap await page const warcGen = await warcGen catch err console finally if client await client }
Puppeteer
Usingconst puppeteer = const Events = const PuppeteerWARCGenerator PuppeteerCapturer = ;async { const browser = await puppeteer const page = await browser const cap = page EventsPageRequest cap await page const warcGen = await warcGen await page await browser}
Note
The generateWARC method used in the preceding examples is helper function for making the WARC generation process simple. See its implementation for a full example of WARC generation using node-warc
Or see one of the crawler implementations provided by Squidwarc.