node-miningcompany
Note: version 1.0.0 no longer includes goldwasher. Use the string and validator module to easily replicate this functionality if needed. See advanced example.
Miningcompany is a tool for gathering scraping and mining text/links from websites at defined points in time. For instance, imagine you wanted to get all headlines from a news site. Not only that but you want them to be collected automatically each hour - but on weekdays only. You also want their related links and a collection of metadata about the headline. Miningcompany is built for this kind of purpose and also includes recommended string and validator tools to work with the results.
The project is built on several other modules:
- node-schedule - used to schedule when the scraper should run.
- krawler - the actual scraping is performed by krawler.
- validator - to check/validate strings.
- underscore.string - to clean up strings.
Everything is built around mining terminology. This (hopefully) makes it easier to understand what is going on in the module. As such, the most commonly used and important objects are:
- maps - an array of JSON objects that each define at minimum a url to scrape. Additional parameters can also be passed in here, for instance targets for later use.
- options - options for miningcompany and krawler.
- cart - a collection of results from scraping one of the maps.
- When you call
open()
on an instantiated miningcompany, it will start up a scheduler. - Every time the scheduler reaches a scheduled point in time, it will fire a new trip.
- On every trip, all the maps will be mined and for each, a cart of results (and eventual errors) will be returned.
- Each cart contains results with their respective cheerio DOM, that you can use to pick out whatever you need.
- What you do from here is up to you, for instance you could easily store it directly with MongoDB for later analysis.
As Miningcompany is an EventEmitter, you can listen for all parts of the cycle and catch the carts. See example below or run the included example.js
to see how it works.
Installation
npm install miningcompany
options
schedule
- a pattern node-schedule will accept. The easiest is to use an object literal as in the example. However, you can also pass in a CRON string if you feel like.krawler
- an optional object literal with additional options for krawler. By default,forceUTF8
is set to true.
Simple example
var Miningcompany = ; // get headlines from frontpage of redditvar maps = url: 'http://www.reddit.com' url: 'http://www.sitethatwillobviouslyfail.com' ; // trip every 10 secondsvar options = schedule: second: 0 10 20 30 40 50 ; var company = maps options; company ; // shut down after 35 seconds;
Advanced example (included as example.js)
var Miningcompany = ; // get headlines from frontpage of cnnvar maps = url: 'http://www.cnn.com' targets: 'h3' url: 'http://www.sitethatwillobviouslyfail.com' targets: 'h1' ; // trip every 10 secondsvar options = schedule: second: 0 10 20 30 40 50 ; var company = maps options; company; // shut down after 35 seconds;