node package manager

phantom-sitemap

Crawls a site, extracts the links and returns the promise of either a sitemap or just a list of links.

Crawls a site, extracts the links and returns the promise of either a sitemap or just a list of links.

If a url has a hashbang (#!) or the page contains the fragment meta tag, the html to parse will be created by calling on phantomjs.

var defaultOptions = { maxDepth: 1,
					   maxFollow: 0,
					   verbose: false,
					   silent: false,
					   //timeout for a request:
					   timeout: 60000,
					   //interval before trying again:
					   retryTimeout: 10000,
					   retries:3,
					   ignore: ['xls', 'png', 'jpg', 'png','js', 'css' ], 
					   include: ['pdf', 'doc'], //include other crawlable assets to list
					   cacheDir: './cache',
					   sitemap: true,
					   out: 'sitemap.xml',
					   replaceHost: 'www.example.com'
					 };

Set options.sitemap to false to return just a list of links.

// Test
var crawl = module.exports(options);
crawl('http://localhost:9000').when(
	function(data) {
		console.log('RESULT:\n', data);
	}
	,function(err) {
		console.log('ERROR', err);
	}
)

Using node-crawler to crawl static pages.

TODO: create html map