MrCrowley
Retrieve data from different websites using html elements to gather the information you need.
Installation
- Install node
npm install -g mrcrowleymrcrowley --config="/home/user/.crawl.json" --save="/home/user/crawlResults.json"
Usage
Core usage
I still have to document how you can require
and use the core
directly but just so that you know, you can do it and the results are based on promises
.
CLI
Set a .crawl.json
and run all the tasks you want when you pass it to mrcrowley
.
Note: Any kind of path should be absolute or relative to the place the script is called.
mrcrowley --config=<config_json_src> --output=<file_to_save_src> --force=<false|true>
Notes:
<config_json_src>
: Path to the config json for crawling. It is required<file_to_save_src>
: Path for the file you want to have the results. For now, onlyjson
is supported. It is requiredforce
: It forces to create a new output. If false and the output file exists, it will just update. It will default tofalse
Configuration
Notes:
-
retrieve
: Besides the simplified version, you may also nest it to get contained data -
attribute
: If not provided, text content will be returned. Optional key. -
ignore
: Ignore results with a regex pattern. Optional key. -
enableJs
: Javascript isn't enable by default for security reasons. Use this if you really need it -
wait
: Usually used withenableJs
. If the sources uses javascript to render, you maywait
for the selector to be present -
<var_to_replace>
: It can also be an object with keysmin
(it will default to0
) andmax
(it will default to10
)
Examples
Go under the src/_test/data folder and check the *.json
.