Easily build flexible, scalable, and distributed, web crawlers.
Easily build flexible, scalable, and distributed, web crawlers for node.
var flexible = require'flexible';// Initiate a crawler. Chainable.var crawler = flexible''useflexiblepgQueue'postgres://postgres:1234@localhost:5432/'route'*/search?q='console.log'Search results handled for query:' reqparamsq;route'*/users/:name'crawlernavigate'' + reqparamsname;route'*'console.log'Every other document is handled by this route.';on'complete'console.log'All of the queued locations have been crawled.';on'error'console.error'Error:' errormessage;;
npm install flexible
Or from source:
git clone git://github.com/eckardto/flexible.gitcd flexiblenpm link
flexibleCrawl the web using Flexible for node.Usage: node [...]/flexible.bin.jsOptions:--url, --uri URL of web page to begin crawling on. [string] [required]--domains, -d List of domains to allow crawling of. [string]--interval, -i Request interval of each crawler.--encoding, -e Encoding of response body for decoding. [string]--max-concurrency, -m Maximum concurrency of each crawler.--max-crawl-queue-length, -M Maximum length of the crawl queue.--user-agent, -A User-agent to identify each crawler as. [string]--timeout, -t Maximum seconds a request can take.--follow-redirect Follow HTTP redirection responses. [boolean]--max-redirects Maximum amount of redirects.--proxy, -p An HTTP proxy to use for requests. [string]--controls, -c Enable pause (ctrl-p), resume (ctrl-r), and abort (ctrl-a). [boolean] [default: true]--pg-uri, --pg-url PostgreSQL URI to connect to for queue. [string]--pg-get-interval PostgreSQL queue get request interval.--pg-max-get-attempts PostgresSQL queue max get attempts.
Returns a configured, navigated and or with crawling started, crawler instance.
Returns a new Crawler object.
Configure the crawler to use a component.
Process a location, and have the crawler navigate (queue) to it.
Have the crawler crawl (recursive).
Have the crawler pause crawling.
Have the crawler resume crawling.
Have the crawler abort crawling.
navigated(url) Emitted when a location has been successfully navigated (queued) to.
document(doc) Emitted when a document is finished being processed by the crawler.
pausedEmitted when the crawler has paused crawling.
resumedEmitted when the crawler has resumed crawling.
completeEmitted when all navigated (queued) to locations have been crawled.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.