The scraping machine
Under development - More news soon
This is just the beginning of a long journey
Let's make web scraping fun again!
From a JSON Config file... you can create a web scraping script and see the output.
Concept
- You just need to define your needs in a JSON file, like demo.json
- The you execute
node index demo.json
in order to start the process in index.js- First it validates the arguments and data
- Then decides the language to use. For now only Python +3 (Beautiful Soup) and Node.js (X-ray) supported
- Then render all the info in the handlebars template, like templates/python.hbs or templates/node.hbs
- The script file is generated, like google.py or google.js
- The script will be executed as a process child by Node generating the final output, like google.json
Demo
Inside demo.json:
Start the machine
- For Python script output:
node index.js demo.json
node index.js demo.json python
- For Node script output
node index.js demo.json js
node index.js demo.json node
Output
- Nodejs file or Python file
- json file with the scraping results
Testing
You can test your changes...
npm test
Future Implementations
- Support for Node.js (X-Ray).
- Support for CSS3 Selectors.
- Support for recursive queries.
- Support for "follow links", like a crawler.
- Implementation as CLI
- Basic Testing
- esLint Support
- JSDoc Support
- Basic Gulp Tasks
- Example Folder
Achievements
v.0.0.3
Features:
- Added support to JSDoc
- Added Gulp Tasks
- Added Basic Testing with Mocha, Chai and Istanbul
- Added .editorconfig
- Added esLint support
- Added example folder
- Added support to Node.js
Notes: Main target: Improved Proof of concept
v.0.0.2
Features:
- Roadmap added
- Added File strucutre
- Defined a minimal json strcuture
- Added minimal validation
- Added a template engine
- Added support for python
- Added dynamic information from the setup config file
Notes: Main target: Proof of concept
v.0.0.1
Features:
Notes: Just a "Hello world"