web-tree-crawler

3.1.0 • Public • Published

web-tree-crawler

A naive web crawler that builds a tree of URLs under a domain using web-tree.

Note: This software is intended for personal learning and testing purposes.

How it works

You pass web-tree-crawler a URL and it tries to discover/visit as many URLs under that domain name as it can within a time limit. When time's up or it's run out of URLs, web-tree-crawler spits out a tree of URLs it visited. There are several configuration options - see the usage sections below.

Install

npm i web-tree-crawler

CLI

Usage

Usage: [option=] web-tree-crawler <url>

Options:
  format     , f  The output format of the tree (default="string")
  headers    , h  File containing headers to send with each request
  numRequests, n  The number of requests to send at a time (default=200)
  outFile    , o  Write the tree to file instead of stdout
  pathList   , p  File containing paths to initially crawl
  timeLimit  , t  The max number of seconds to run (default=120)
  verbose    , v  Log info and progress to stdout

Examples

Crawl and print tree to stdout

$ h=/path/to/file web-tree-crawler <url>

.com
  .domain
    .subdomain1
      /foo
        /bar
      .subdomain-of-subdomain1
        /baz
          ?q=1
    .subdomain2
...

And to print an HTML tree...

$ f=html web-tree-crawler <url>

...

Crawl and write tree to file

$ o=/path/to/file web-tree-crawler <url>

Wrote tree to file!

Crawl with verbose logging

$ v=true web-tree-crawler <url>

Visited "<url>"
Visited "<another-url>"
...

JS

Usage

/**
 * This is the main exported function that crawls and resolves the URL tree.
 *
 * @param  {String}   url 
 * @param  {Object}   [opts = {}] 
 * @param  {Object}   [opts.headers]           - headers to send with each request
 * @param  {Number}   [opts.numRequests = 200] - the number of requests to send at a time
 * @param  {String[]} [opts.startPaths]        - paths to initially crawl
 * @param  {Number}   [opts.timeLimit = 120]   - the max number of seconds to run for
 * @param  {Boolean}  [opts.verbose]           - if true, logs info and progress to stdout
 * @param  {}         [opts....]               - additional options for #lib.request()
 *
 * @return {Promise} 
 */

Example

'use strict'
 
const crawl = require('web-tree-crawler')
 
crawl(url, opts)
  .then(tree => { ... })
  .catch(err => { ... })

Test

npm test

Lint

npm run lint

Documentation

npm run doc

Generate the docs and open in browser.

Contributing

Please do!

If you find a bug, want a feature added, or just have a question, feel free to open an issue. In addition, you're welcome to create a pull request addressing an issue. You should push your changes to a feature branch and request merge to develop.

Make sure linting and tests pass and coverage is 💯 before creating a pull request!

Package Sidebar

Install

npm i web-tree-crawler

Weekly Downloads

1

Version

3.1.0

License

MIT

Unpacked Size

44.8 kB

Total Files

25

Last publish

Collaborators

  • d4l3c00p3r
  • zbo14