Hub Crawl
Hub Crawl finds broken links in Github repositories. It finds links in the
readme
portions of the repos (or the wiki-content
section for wiki pages),
scrapes the links of those sections, and continues the crawl beginning with
those newfound links. Additionally, the requests are made in parallel to ensure
a speedy crawl. It essentially performs a concurrent, breadth-first graph
traversal and logs broken links as it goes.
Installation
Global Use
To begin using Hub Crawl, install it globally with npm
.
npm install -g hub-crawl
Or, if you use yarn
:
yarn global add hub-crawl
Use in Projects
To add Hub Crawl to your project, install it with npm
:
npm install hub-crawl
Or, if you use yarn
:
yarn add hub-crawl
Terminology
Regardless of how you choose to implement Hub Crawl, the following are important terms:
entry
The entry
is the url that is first visited and scraped for links.
scope
The scope
is a url that defines the limit of link scraping. For example, let's
assume the scope
is set to https://github.com/louisscruz/hub-crawl
. If
https://github.com/louisscruz/hub-crawl/other
is in the queue, it will be both
visited and scraped. However, if https://google.com
is in the queue, it will
be visited, but not scraped because the url does not begin with the scope url.
This keeps Hub Crawl from scouring the entire internet. If you do not provide a
scope, Hub Crawl defaults to using the entry
that was provided.
workers
The number of workers determines the maximum number of parallel requests to be open at any given time. The optimal number of workers depends on your hardware and internet speed.
Usage
There are two ways to use Hub Crawl. For common usage, it is likely preferable to use the command line. If you are integrating Hub Crawl into a bigger project, it is probably worth importing or requiring the Hub Crawl class.
Command Line
After Hub Crawl is installed globally, you can run hub-crawl
in the command
line. It accepts the following arguments and options in the following format:
hub-crawl [entry] [scope] -l -w 12
Arguments
[entry]
If not provided, the program will prompt you for this.
[scope]
If not provided, the program will prompt you for this.
Options
-l
If this option is provided, then an initial log in window will appear so that the crawl is authenticated while running. This is useful for private repos.
-w
If this option is provided, it will set the maximum number of workers. For
instance, -w 24
would set a maximum of 24 workers.
-V
This option shows the current version of hub-crawl
.
Importing
If you would like to use Hub Crawl in a project, feel free to import it as such:
;
Or, if you're still not using import
:
var HubCrawl = ;
Create an instance:
const crawler = 12 'https://google.com'
HubCrawl
takes the following as arguments at instantiation:
;
The methods available on HubCrawl
instances can be seen
here. However, the most important methods follow below.
traverseAndLogOutput(login)
This method performs the traversal and logs the broken links. The login
argument determines whether or not an initial window will pop up to log in.
traverse(login)
This method performs the traversal. The login
argument determines whether or
not an initial window will pop up to log in. The traverse
method returns the
broken links. Note that the workers are left alive afterwards.
False Positives
As it currently stands, the crawler only makes a single, breadth-first, concurrent graph traversal. If it happens to be the case that a server was only temporarily down during the traversal, it will still count as a broken link. In the future, a second check will be made on each of the broken links to ensure that they are indeed broken.
Future Improvements
- Set the scope through user input, rather than defaulting to the entry
- Run the queries in parallel, rather than synchronously
- Make into NPM package
- Allow for CLI usage
- Also allow for fallback prompts
- Perform a second check on all broken links to minimize false positives
- Make the output look better
- Allow for the crawler to be easily distributed