Wondering what’s next for npm?Check out our public roadmap! »

    microcrawler

    0.1.30 • Public • Published

    microcrawler

    Status

    npm version Dependency Status Code Climate Coverage Status Build Status Downloads

    NPM

    Screenshots

    Available Official Crawlers

    List of official publicly available crawlers.

    Missing something? Feel free to open issue.

    Prerequisites

    Installation

    From npmjs.org (the easy way)

    This is the easiest way. The prerequisites still needs to be satisfied.

    npm install -g microcrawler
    

    From Sources

    This is useful if you want to tweak the source code, implement new crawler, etc.

    # Clone repository
    git clone https://github.com/ApolloCrawler/microcrawler.git
    
    # Enter folder
    cd microcrawler
    
    # Install required packages - dependencies
    npm install
    
    # Install from local sources
    npm install -g .
    

    Usage

    Show available commands

    $ microcrawler
    
      Usage: microcrawler [options] [command]
    
    
      Commands:
    
        collector [args]  Run data collector
        config [args]     Run config
        exporter [args]   Run data exporter
        worker [args]     Run crawler worker
        crawl [args]      Crawl specified site
        help [cmd]        display help for [cmd]
    
      Options:
    
        -h, --help     output usage information
        -V, --version  output the version number
    

    Check microcrawler version

    $ microcrawler --version
    0.1.27
    

    Initialize config file

    $ microcrawler config init
    2016-09-03T10:45:13.105Z - info: Creating config file "/Users/tomaskorcak/.microcrawler/config.json"
    {
        "client": "superagent",
        "timeout": 10000,
        "throttler": {
            "enabled": false,
            "active": true,
            "rate": 20,
            "ratePer": 1000,
            "concurrent": 8
        },
        "retry": {
            "count": 2
        },
        "headers": {
            "Accept": "*/*",
            "User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
            "From": "googlebot(at)googlebot.com"
        },
        "proxy": {
            "enabled": false,
            "list": [
                "https://168.63.20.19:8145"
            ]
        },
        "natFaker": {
            "enabled": true,
            "base": "192.168.1.1",
            "bits": 16
        },
        "amqp": {
            "uri": "amqp://localhost",
            "queues": {
                "collector": "collector",
                "worker": "worker"
            },
            "options": {
                "heartbeat": 60
            }
        },
        "couchbase": {
            "uri": "couchbase://localhost:8091",
            "bucket": "microcrawler",
            "username": "Administrator",
            "password": "Administrator",
            "connectionTimeout": 60000000,
            "durabilityTimeout": 60000000,
            "managementTimeout": 60000000,
            "nodeConnectionTimeout": 10000000,
            "operationTimeout": 10000000,
            "viewTimeout": 10000000
        },
        "elasticsearch": {
            "uri": "localhost:9200",
            "index": "microcrawler",
            "log": "debug"
        }
    }
    

    Edit config file

    $ vim ~/.microcrawler/config.json
    

    Show config file

    $ microcrawler config show
    {
        "client": "superagent",
        "timeout": 10000,
        "throttler": {
            "enabled": false,
            "active": true,
            "rate": 20,
            "ratePer": 1000,
            "concurrent": 8
        },
        "retry": {
            "count": 2
        },
        "headers": {
            "Accept": "*/*",
            "User-Agent": "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
            "From": "googlebot(at)googlebot.com"
        },
        "proxy": {
            "enabled": false,
            "list": [
                "https://168.63.20.19:8145"
            ]
        },
        "natFaker": {
            "enabled": true,
            "base": "192.168.1.1",
            "bits": 16
        },
        "amqp": {
            "uri": "amqp://example.com",
            "queues": {
                "collector": "collector",
                "worker": "worker"
            },
            "options": {
                "heartbeat": 60
            }
        },
        "couchbase": {
            "uri": "couchbase://example.com:8091",
            "bucket": "microcrawler",
            "username": "Administrator",
            "password": "Administrator",
            "connectionTimeout": 60000000,
            "durabilityTimeout": 60000000,
            "managementTimeout": 60000000,
            "nodeConnectionTimeout": 10000000,
            "operationTimeout": 10000000,
            "viewTimeout": 10000000
        },
        "elasticsearch": {
            "uri": "example.com:9200",
            "index": "microcrawler",
            "log": "debug"
        }
    }
    

    Start Couchbase

    TBD

    Start Elasticsearch

    TBD

    Start Kibana

    TBD

    Query elasticsearch

    TBD

    Example usage

    Craiglist

    microcrawler crawl craiglist.index http://sfbay.craigslist.org/sfc/sss/
    

    Firmy.cz

    microcrawler crawl firmy.cz.index "https://www.firmy.cz?_escaped_fragment_="
    

    Google

    microcrawler crawl google.index http://google.com/search?q=Buena+Vista
    

    Hacker News

    microcrawler crawl hackernews.index https://news.ycombinator.com/
    

    xkcd

    microcrawler crawl xkcd.index http://xkcd.com
    

    Yelp

    microcrawler crawl yelp.index "http://www.yelp.com/search?find_desc=restaurants&find_loc=Los+Angeles%2C+CA&ns=1&ls=f4de31e623458437"
    

    Youjizz

    microcrawler crawl youjizz.com.index http://youjizz.com
    

    Credits

    Keywords

    none

    Install

    npm i microcrawler

    DownloadsWeekly Downloads

    1

    Version

    0.1.30

    License

    MIT

    Last publish

    Collaborators

    • avatar