Web crawler configured by JSON configurations defining what data fields to scrape from the visited websites using regular expressions or DOM selectors and how to export them as JSON
Command-line-based web crawler configured by JSON configurations, defining what data fields to scrape from the visited websites and how to export them as JSON.
$ npm install scraperrr
Several example configuration are provided in the
$ scraperrr -c examples/gvtk131_config.json
This will export a JSON data file with the motions from the wiki.
$ scraperrr -c examples/gvtk131_config.json -o out/gv-anträge.json
The resulting json export is saved in the defined output file.
$ scraperrr -v -c examples/gvtk131_config.json
Verbose output for debugging.
--verbose works as well.
$ scraperrr -p 500 -c examples/gvtk131_config.json
Politeness, defines a waiting period in miliseconds between HTTP requests.
--politeness works as well.
The config format is still in development and changes occasionally. Once, it is freezed full documentation will be provided.
- Flexible configuration files for scraping websites and exporting results to a specified JSON file
- Waiting period between HTTP-requests