A NodeJS crawler library to quick and easy build versatile crawlers. Just to make working with request
and cheerio
a little easier and to not have to write all the standard stuff over and over again.
Functions
- Play nice with servers: Wait between each request.
- Get ´next´ and ´last´ URL for pagination scenario.
- Write list syncronusly to file at the end
- Serving header info
Examples
- List crawling: Crawl paginated lists for URLs
Functionality to be
- Item crawling
- Pagination iteration, second version
- Define which domain(s) to crawl
- Site-crawl - Add found URLs to crawl queue
- Write content asyncronusly (add to file) throughout crawling.
- Follow robots.txt
- Check if new content
- Check if updated content
- Overwrite crawler header and set ´from´-field.
- Crawl with headless browser.