Easily build flexible, scalable, and distributed, web crawlers.
This project will soon be superseded by node-web-crawler.
Easily build flexible, scalable, and distributed, web crawlers for node.
var flexible = require'flexible';// Initiate a crawler. Chainable.var crawler = flexible''useflexiblepgQueue'postgres://postgres:1234@localhost:5432/'route'*/search?q='console.log'Search results handled for query:' reqparamsq;route'*/users/:name'crawlernavigate'' + reqparamsname;route'*'console.log'Every other document is handled by this route.';on'complete'console.log'All of the queued locations have been crawled.';on'error'console.error'Error:' errormessage;;
- Asynchronous friendly, and evented, API for easily building flexible, scalable, and distributed web crawlers.
- An array based queue for small crawls, and a PostgreSQL based queue for massive, and efficient, crawls.
- Uses a fast, lightweight, and forgivable, HTML parser to ensure proper document compatibility for crawling.
- Component system; use different queues, a router (wildcards, placeholders, etc), and other components.
npm install flexible
Or from source:
git clone git://github.com/eckardto/flexible.gitcd flexiblenpm link
flexibleCrawl the web using Flexible for node.Usage: node [...]/flexible.bin.jsOptions:--url, --uri URL of web page to begin crawling on. [string] [required]--domains, -d List of domains to allow crawling of. [string]--interval, -i Request interval of each crawler.--encoding, -e Encoding of response body for decoding. [string]--max-concurrency, -m Maximum concurrency of each crawler.--max-crawl-queue-length, -M Maximum length of the crawl queue.--user-agent, -A User-agent to identify each crawler as. [string]--timeout, -t Maximum seconds a request can take.--follow-redirect Follow HTTP redirection responses. [boolean]--max-redirects Maximum amount of redirects.--proxy, -p An HTTP proxy to use for requests. [string]--controls, -c Enable pause (ctrl-p), resume (ctrl-r), and abort (ctrl-a). [boolean] [default: true]--pg-uri, --pg-url PostgreSQL URI to connect to for queue. [string]--pg-get-interval PostgreSQL queue get request interval.--pg-max-get-attempts PostgresSQL queue max get attempts.
Returns a configured, navigated and or with crawling started, crawler instance.
Returns a new Crawler object.
Configure the crawler to use a component.
Process a location, and have the crawler navigate (queue) to it.
Have the crawler crawl (recursive).
Have the crawler pause crawling.
Have the crawler resume crawling.
Have the crawler abort crawling.
navigated(url) Emitted when a location has been successfully navigated (queued) to.
document(doc) Emitted when a document is finished being processed by the crawler.
pausedEmitted when the crawler has paused crawling.
resumedEmitted when the crawler has resumed crawling.
completeEmitted when all navigated (queued) to locations have been crawled.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.