node package manager
Share your code. npm Orgs help your team discover, share, and reuse code. Create a free org »

twitter-harvest

twitter-harvest NPM version Build Status Dependency Status Coverage percentage

A simple continuous harvester for twitter

This application is able to capture tweets which happen around the world. Currently it works only with the Twitter stream API 1.1.

  • You have to define or modify the cfg/cfg.json and create at least one capture agent in cfg/agents/ directory (enable to true).
  • You can activate mail alert from a SMTP account like gmail (see Private configuration and the mail_alert flag in main configuration)
  • If fs_out is true (default), the captured tweets are written to the file system with the following convention:
  • If todo_out is true (should be false by default), a kind of queue is created (directory 'data/TODO') where filenames to consume by an external process. This allow to write the tweets to any db
    • Note, that the number of files by directory is limited (depend of the OS), the filenames need to be consumed by the external process regularly to avoid issues

data_dir/year/month/day/hour-min-sec_tweet-id

e.g.

data/2015/9/24/16-30-44_647055571951190000

Install

$ npm install --save twitter-harvest

Usage

node twitter-harvest.js

Usage with forever

$ npm install -g forever
$ forever start twitter-harvest.js

With forever it is possible to run the task 'forever'. And leave your session.

Main configuration

{
  "agents_dir"    : "cfg/agents/",
  "data_dir"      : "./data/",
  "private_cfg"   : "./cfg/cfg-private.json",
 
  "mail_alert"    : false,
 
  "fs_out"        : true,
  "std_out"       : true,
  "todo_out"      : true  
}
  • agents_dir: path where to put the agent file
  • data_dir: path where to write the tweets on the file system
  • private_cfg: file where private data is stored (such as mail credential)
  • mail_alert: if true enable mail alerting in case of failure
  • fs_out: if true write the twitter data on the file system
  • std_out: if true write the twitter data on the console
  • todo_out: if true write the json filename in the 'data/TODO' dir (to be consumed by an other process to BD (mysql, ...)

Agents configuration

put all the agent definition files to the agent directory (one file per agent).

$ cat cfg/agents/*.json
{
  "type_doc"            : "twitter",
  "enable"              : true,
  "type_filter"         : "track",
  "type_api"            : "stream",
  "name"                : "keywords-geneva",
  "filter"              : {
    "track"             : "genève,geneva,genebra,genevra,genf"
  },
  "stream"              : "filter",
  "consumer_key"        : "...",
  "consumer_secret"     : "...",
  "access_token_key"    : "...",
  "access_token_secret" : "..."  
}

to capture all the tweets where there is a mention of geneva word for several languages.

{
  "type_doc"            : "twitter",
  "enable"              : true,
  "type_filter"         : "locations",
  "type_api"            : "stream",
  "name"                : "location-geneva",
  "filter"              : {
    "locations"  : "5.77,45.85,7.15,46.80"
  },
  "stream"              : "filter",
  "consumer_key"        : "...",
  "consumer_secret"     : "...",
  "access_token_key"    : "...",
  "access_token_secret" : "..."
}

to capture all the tweets which are posted around Geneva area (Switzerland).

  • type_doc : 'twitter'
  • enable : if true this agent is launched
  • type_filter : locations | filter | follow
  • stream : filter | firehose (if you have the chance)
  • consumer_key, consumer_secret, access_token_key, access_token_secret : personal keys given by twitter for using their APIs

more API twitter doc https://dev.twitter.com/streaming/overview/request-parameters

Private configuration

{
  "mail_service"    : "gmail",
  "mail_auth_user"  : "username",
  "mail_auth_path"  : "password",
  "mail_from"       : "alert_twitter_harvest",
  "mail_to"         : "name@gmail.com"
}
  • mail_service : name of the mail service
  • mail_auth_user : username credential of the mail service
  • mail_auth_path : password credential of the mail service
  • mail_from : who will send the mail
  • mail_to : who want to be alerted

One mail is also sent when the system is started, you should received this mail on your mail box if all well configured.

note : supported mail system is given by nodemailer node module (here is the supported service https://github.com/andris9/nodemailer-wellknown#supported-services), but only gmail was tested for gmail, it is possible you have to decrease the security level of your mail account (so don't use a personal account) and to authorize specifically the application by using this url: https://g.co/allowaccess

Test

$ gulp

Notes

Note that currently, we have 3 errors messages when twitter-harvest is launched. This is not important. Here are theses Error messages

{ [Error: Cannot find module './build/Release/DTraceProviderBindings'] code: 'MODULE_NOT_FOUND' }
{ [Error: Cannot find module './build/default/DTraceProviderBindings'] code: 'MODULE_NOT_FOUND' }
{ [Error: Cannot find module './build/Debug/DTraceProviderBindings'] code: 'MODULE_NOT_FOUND' }

To do

  • add more tests
  • add extra option to add extra info in the output(from agents)
  • add other api interface (not only the streaming API)

License

MIT © Arnaud Gaudinat

Change log

  • 0.3.4:
    • chat the node twitter lib with Twit (for better handling of error)
  • 0.3.3:
    • add the TODO option and directory to allow writing in DB
    • add 2 digits on filenames and JSON extension
  • 0.3.2:
    • add JSONschema validation