Need Package Maintenance

    twitter-harvest

    0.3.4 • Public • Published

    twitter-harvest NPM version Build Status Dependency Status Coverage percentage

    A simple continuous harvester for twitter

    This application is able to capture tweets which happen around the world. Currently it works only with the Twitter stream API 1.1.

    • You have to define or modify the cfg/cfg.json and create at least one capture agent in cfg/agents/ directory (enable to true).
    • You can activate mail alert from a SMTP account like gmail (see Private configuration and the mail_alert flag in main configuration)
    • If fs_out is true (default), the captured tweets are written to the file system with the following convention:
    • If todo_out is true (should be false by default), a kind of queue is created (directory 'data/TODO') where filenames to consume by an external process. This allow to write the tweets to any db
      • Note, that the number of files by directory is limited (depend of the OS), the filenames need to be consumed by the external process regularly to avoid issues

    data_dir/year/month/day/hour-min-sec_tweet-id

    e.g.

    data/2015/9/24/16-30-44_647055571951190000

    Install

    $ npm install --save twitter-harvest

    Usage

    node twitter-harvest.js

    Usage with forever

    $ npm install -g forever
    $ forever start twitter-harvest.js

    With forever it is possible to run the task 'forever'. And leave your session.

    Main configuration

    {
      "agents_dir"    : "cfg/agents/",
      "data_dir"      : "./data/",
      "private_cfg"   : "./cfg/cfg-private.json",
     
      "mail_alert"    : false,
     
      "fs_out"        : true,
      "std_out"       : true,
      "todo_out"      : true  
    }
    • agents_dir: path where to put the agent file
    • data_dir: path where to write the tweets on the file system
    • private_cfg: file where private data is stored (such as mail credential)
    • mail_alert: if true enable mail alerting in case of failure
    • fs_out: if true write the twitter data on the file system
    • std_out: if true write the twitter data on the console
    • todo_out: if true write the json filename in the 'data/TODO' dir (to be consumed by an other process to BD (mysql, ...)

    Agents configuration

    put all the agent definition files to the agent directory (one file per agent).

    $ cat cfg/agents/*.json
    {
      "type_doc"            : "twitter",
      "enable"              : true,
      "type_filter"         : "track",
      "type_api"            : "stream",
      "name"                : "keywords-geneva",
      "filter"              : {
        "track"             : "genève,geneva,genebra,genevra,genf"
      },
      "stream"              : "filter",
      "consumer_key"        : "...",
      "consumer_secret"     : "...",
      "access_token_key"    : "...",
      "access_token_secret" : "..."  
    }

    to capture all the tweets where there is a mention of geneva word for several languages.

    {
      "type_doc"            : "twitter",
      "enable"              : true,
      "type_filter"         : "locations",
      "type_api"            : "stream",
      "name"                : "location-geneva",
      "filter"              : {
        "locations"  : "5.77,45.85,7.15,46.80"
      },
      "stream"              : "filter",
      "consumer_key"        : "...",
      "consumer_secret"     : "...",
      "access_token_key"    : "...",
      "access_token_secret" : "..."
    }

    to capture all the tweets which are posted around Geneva area (Switzerland).

    • type_doc : 'twitter'
    • enable : if true this agent is launched
    • type_filter : locations | filter | follow
    • stream : filter | firehose (if you have the chance)
    • consumer_key, consumer_secret, access_token_key, access_token_secret : personal keys given by twitter for using their APIs

    more API twitter doc https://dev.twitter.com/streaming/overview/request-parameters

    Private configuration

    {
      "mail_service"    : "gmail",
      "mail_auth_user"  : "username",
      "mail_auth_path"  : "password",
      "mail_from"       : "alert_twitter_harvest",
      "mail_to"         : "name@gmail.com"
    }
    • mail_service : name of the mail service
    • mail_auth_user : username credential of the mail service
    • mail_auth_path : password credential of the mail service
    • mail_from : who will send the mail
    • mail_to : who want to be alerted

    One mail is also sent when the system is started, you should received this mail on your mail box if all well configured.

    note : supported mail system is given by nodemailer node module (here is the supported service https://github.com/andris9/nodemailer-wellknown#supported-services), but only gmail was tested for gmail, it is possible you have to decrease the security level of your mail account (so don't use a personal account) and to authorize specifically the application by using this url: https://g.co/allowaccess

    Test

    $ gulp

    Notes

    Note that currently, we have 3 errors messages when twitter-harvest is launched. This is not important. Here are theses Error messages

    { [Error: Cannot find module './build/Release/DTraceProviderBindings'] code: 'MODULE_NOT_FOUND' }
    { [Error: Cannot find module './build/default/DTraceProviderBindings'] code: 'MODULE_NOT_FOUND' }
    { [Error: Cannot find module './build/Debug/DTraceProviderBindings'] code: 'MODULE_NOT_FOUND' }

    To do

    • add more tests
    • add extra option to add extra info in the output(from agents)
    • add other api interface (not only the streaming API)

    License

    MIT © Arnaud Gaudinat

    Change log

    • 0.3.4:
      • chat the node twitter lib with Twit (for better handling of error)
    • 0.3.3:
      • add the TODO option and directory to allow writing in DB
      • add 2 digits on filenames and JSON extension
    • 0.3.2:
      • add JSONschema validation

    Install

    npm i twitter-harvest

    DownloadsWeekly Downloads

    7

    Version

    0.3.4

    License

    MIT

    Last publish

    Collaborators

    • gaudinat