A simple continuous harvester for twitter
This application is able to capture tweets which happen around the world. Currently it works only with the Twitter stream API 1.1.
- You have to define or modify the
cfg/cfg.json
and create at least one captureagent
incfg/agents/
directory (enable
totrue
). - You can activate mail alert from a SMTP account like gmail (see Private configuration and the
mail_alert
flag in main configuration) - If
fs_out
istrue
(default), the captured tweets are written to the file system with the following convention: - If
todo_out
istrue
(should be false by default), a kind of queue is created (directory 'data/TODO') where filenames to consume by an external process. This allow to write the tweets to any db- Note, that the number of files by directory is limited (depend of the OS), the filenames need to be consumed by the external process regularly to avoid issues
data_dir/year/month/day/hour-min-sec_tweet-id
e.g.
data/2015/9/24/16-30-44_647055571951190000
Install
$ npm install --save twitter-harvest
Usage
node twitter-harvestjs
forever
Usage with $ npm install -g forever$ forever start twitter-harvest.js
With forever it is possible to run the task 'forever'. And leave your session.
Main configuration
- agents_dir: path where to put the agent file
- data_dir: path where to write the tweets on the file system
- private_cfg: file where private data is stored (such as mail credential)
- mail_alert: if true enable mail alerting in case of failure
- fs_out: if true write the twitter data on the file system
- std_out: if true write the twitter data on the console
- todo_out: if true write the json filename in the 'data/TODO' dir (to be consumed by an other process to BD (mysql, ...)
Agents configuration
put all the agent definition files to the agent directory (one file per agent).
$ cat cfg/agents/*.json
to capture all the tweets where there is a mention of geneva word for several languages.
to capture all the tweets which are posted around Geneva area (Switzerland).
- type_doc : 'twitter'
- enable : if
true
this agent is launched - type_filter : locations | filter | follow
- stream : filter | firehose (if you have the chance)
- consumer_key, consumer_secret, access_token_key, access_token_secret : personal keys given by twitter for using their APIs
more API twitter doc https://dev.twitter.com/streaming/overview/request-parameters
Private configuration
- mail_service : name of the mail service
- mail_auth_user : username credential of the mail service
- mail_auth_path : password credential of the mail service
- mail_from : who will send the mail
- mail_to : who want to be alerted
One mail is also sent when the system is started, you should received this mail on your mail box if all well configured.
note : supported mail system is given by nodemailer node module (here is the supported service https://github.com/andris9/nodemailer-wellknown#supported-services), but only gmail was tested for gmail, it is possible you have to decrease the security level of your mail account (so don't use a personal account) and to authorize specifically the application by using this url: https://g.co/allowaccess
Test
$ gulp
Notes
Note that currently, we have 3 errors messages when twitter-harvest is launched. This is not important. Here are theses Error messages
To do
- add more tests
- add extra option to add extra info in the output(from agents)
- add other api interface (not only the streaming API)
License
MIT © Arnaud Gaudinat
Change log
- 0.3.4:
- chat the node twitter lib with Twit (for better handling of error)
- 0.3.3:
- add the TODO option and directory to allow writing in DB
- add 2 digits on filenames and JSON extension
- 0.3.2:
- add JSONschema validation