web-tree-crawler
A naive web crawler that builds a tree of URLs under a domain using web-tree.
Note: This software is intended for personal learning and testing purposes.
How it works
You pass web-tree-crawler
a URL and it tries to discover/visit as many URLs under that domain name as it can within a time limit. When time's up or it's run out of URLs, web-tree-crawler
spits out a tree of URLs it visited. There are several configuration options - see the usage sections below.
Install
npm i web-tree-crawler
CLI
Usage
Usage: [option=] web-tree-crawler <url>
Options:
format , f The output format of the tree (default="string")
headers , h File containing headers to send with each request
numRequests, n The number of requests to send at a time (default=200)
outFile , o Write the tree to file instead of stdout
pathList , p File containing paths to initially crawl
timeLimit , t The max number of seconds to run (default=120)
verbose , v Log info and progress to stdout
Examples
Crawl and print tree to stdout
$ h=/path/to/file web-tree-crawler <url>
.com
.domain
.subdomain1
/foo
/bar
.subdomain-of-subdomain1
/baz
?q=1
.subdomain2
...
And to print an HTML tree...
$ f=html web-tree-crawler <url>
...
Crawl and write tree to file
$ o=/path/to/file web-tree-crawler <url>
Wrote tree to file!
Crawl with verbose logging
$ v=true web-tree-crawler <url>
Visited "<url>"
Visited "<another-url>"
...
JS
Usage
/** * This is the main exported function that crawls and resolves the URL tree. * * @param * @param * @param * @param * @param * @param * @param * @param * * @return */
Example
'use strict' const crawl =
Test
npm test
Lint
npm run lint
Documentation
npm run doc
Generate the docs and open in browser.
Contributing
Please do!
If you find a bug, want a feature added, or just have a question, feel free to open an issue. In addition, you're welcome to create a pull request addressing an issue. You should push your changes to a feature branch and request merge to develop
.
Make sure linting and tests pass and coverage is 💯 before creating a pull request!