node-crawling-framework
Current stage: aplha (Work in progress)
"node-crawling-framework" is a crawling & scraping framework for NodeJs heavily inspired by Scrapy.
A node job server is also in motion (kinda scrapyd equivalent based on BullJs).
Features (not fully tested and finalized)
The core is working: Crawler, Scraper, Spider, item processors (pipeline), DownloadManager, downloader.
-
Modular and easily extendable architecture through middlewares and class inheritance:
- add your own middlewares for spiders, item-processors, and downloaders.
- extend framework spiders and get some features for free.
-
DownloadManager: delay and concurency limit settings,
-
RequestDownloader: downloader based on request package,
-
Downloader middlewares:
- cookie: handle cookie storage between requests,
- defaultHeaders: add default headers to each request,
- retry: retry requests on error,
- stats: collect some stats during the crawling (requests & errors count, ...)
-
Spiders:
- BaseSpider: every spider must inherhit from this one,
- Sitemap: parse sitemap and feed the spider with found urls,
- Elasticsearch: feed spider urls with elasticsearch
-
Spider middlewares:
- cheerio: cheerio helper on response to get a cheerio object,
- scrapeUtils: cheerio + some helpers to facilitate the scraping (methods: scrape, scrapeUrl, scrapeRequest, ...),
- filterDomains: filter non authorized domains
-
Item processor middlewares:
- printConsole: log items to the console,
- jsonLineFileExporter: write scraped items to a json file, one line = one json (easier to parse atferwards, smaller memory footprint),
- logger: log items to the logger,
- elasticsearchExporter: export items to elasticsearch
-
Logger: configurable logger (default: console)
Project example
See Quotesbot
Spider example
const BaseSpider = ; { super; thisstartUrls = 'http://quotes.toscrape.com'; } * { const quotes = response; for let quote of quotes text: quotetext author: quotetext tags: quotetext ; response; } moduleexports = CssSpider;
Crawler configuration example
moduleexports = settings: maxDownloadConcurency: 1 // maximum download concurrency, default: 1 filterDuplicateRequests: true // filter already scraped requests, default: true delay: 100 // delay in ms between requests, default: 0 maxConcurrentScraping: 500 // maximum concurrent scraping, default: 500 maxConcurrentItemsProcessingPerResponse: 100 // maximum concurrent item processing per response, default: 100 autoCloseOnIdle: true // auto close crawler when crawling is finished, default:true logger: null // logger, must implement console interface, default: console spider: type: '' // spider to use for crawling, search spider in ${cwd} or ${cwd}/spiders, can also be a class definition object options: {} // spider constructor args middlewares: scrapeUtils: {} // add utils methods to the response, ex: "response.scrape()" filterDomains: {} // avoid unwanted domain requests from being scheduled itemProcessor: middlewares: jsonLineFileExporter: {} // write scraped items to a json file, one line = one json (easier to parse atferwards, smaller memory footprint) logger: {} // log scraped items through the crawler logger downloader: type: 'RequestDownloader' // downloader to use, can also be a class definition object options: {} // downloader constructor args middlewares: stats: {} // give some stats about requests, ex: number of requests/errors retry: {} // retry on failed requests cookie: {} // store cookie between requests ;
Crawler instantiation example
const createCrawler = ; const config = ;const crawler = ; crawler;
TODO list
- Add unit tests
- Add documentation
- Add MongoDb feeder/exporter
- Make some benchmarks ?
- Finish formRequest scraping ( add clickables elements)
- Utils: add date parse(moment wrapper), datapager helper ?
- adding multi spider support ?
- add crawling queue to settings / possibility to override the queue (could allow shared redis queue for distributed crawling)
- allow to override/set/configure DownloadManager: could allow proxy pool handling for example
- Puppeteer downloader:
- be compatible with header and cookie middlewares
- Split plugins/middlewares in packages
- Command line tool, "ncf-cli"
- scaffolding: create project (with wizard), spider, any middleware
- crawl: launch crawl
- deploy: deploy to node-job-server
- find solution for Dns Caching
- middleware to respect "robots.txt"
- limit max reponse size
- auto throttle