skip to:content package search sign in

❤

Pro
Teams
Pricing
Documentation

npm

Sign Up Sign In

the-scraping-machine

0.0.3 • Public • Published 8 years ago

Readme
Code Beta
12 Dependencies
0 Dependents
3 Versions

shieldsIO

The scraping machine

Under development - More news soon

gilling_machine

This is just the beginning of a long journey

Let's make web scraping fun again!

From a JSON Config file... you can create a web scraping script and see the output.

Concept

You just need to define your needs in a JSON file, like demo.json
The you execute node index demo.json in order to start the process in index.js
- First it validates the arguments and data
- Then decides the language to use. For now only Python +3 (Beautiful Soup) and Node.js (X-ray) supported
- Then render all the info in the handlebars template, like templates/python.hbs or templates/node.hbs
The script file is generated, like google.py or google.js
The script will be executed as a process child by Node generating the final output, like google.json

Demo

Inside demo.json:

{
    "source_type": "url",
    "url": "http://google.es",
    "file_name": "google",
    "data": [
        {
            "name": "web-title",
            "type": "selector",
            "query": "title"
        }, {
            "name": "web2",
            "type": "selector",
            "query": "title"
        }
    ]
}

Start the machine

For Python script output:

    node index.js demo.json

    node index.js demo.json python

For Node script output

    node index.js demo.json js

    node index.js demo.json node

Output

Nodejs file or Python file
json file with the scraping results

[
    {
        "web-title": "Google",
        "web2": "Google"
    }
]

Testing

You can test your changes...

npm test

Future Implementations

Support for Node.js (X-Ray).
Support for CSS3 Selectors.
Support for recursive queries.
Support for "follow links", like a crawler.
Implementation as CLI
Basic Testing
esLint Support
JSDoc Support
Basic Gulp Tasks
Example Folder

Achievements

v.0.0.3

Features:

Added support to JSDoc
Added Gulp Tasks
Added Basic Testing with Mocha, Chai and Istanbul
Added .editorconfig
Added esLint support
Added example folder
Added support to Node.js

Notes: Main target: Improved Proof of concept

v.0.0.2

Features:

Roadmap added
Added File strucutre
Defined a minimal json strcuture
Added minimal validation
Added a template engine
Added support for python
Added dynamic information from the setup config file

Notes: Main target: Proof of concept

v.0.0.1

Features:

Notes: Just a "Hello world"

Dependents (0)

Package Sidebar

Install

npm i the-scraping-machine

Repository

github.com/UlisesGascon/the-scraping-machine

Homepage

github.com/UlisesGascon/the-scraping-machine#readme

Weekly Downloads

0

Version

0.0.3

License

GPL-3.0

Last publish

8 years ago

Collaborators

Try on RunKit

Report malware

Footer

Support

Help
Advisories
Status
Contact npm

Company

About
Blog
Press

Terms & Policies

Policies
Terms of Use
Code of Conduct
Privacy