A simple continuous harvester for twitter
This application is able to capture tweets which happen around the world. Currently it works only with the Twitter stream API 1.1.
- You have to define or modify the
cfg/cfg.jsonand create at least one capture
- You can activate mail alert from a SMTP account like gmail (see Private configuration and the
mail_alertflag in main configuration)
true(default), the captured tweets are written to the file system with the following convention:
true(should be false by default), a kind of queue is created (directory 'data/TODO') where filenames to consume by an external process. This allow to write the tweets to any db
- Note, that the number of files by directory is limited (depend of the OS), the filenames need to be consumed by the external process regularly to avoid issues
$ npm install --save twitter-harvest
$ npm install -g forever$ forever start twitter-harvest.js
With forever it is possible to run the task 'forever'. And leave your session.
- agents_dir: path where to put the agent file
- data_dir: path where to write the tweets on the file system
- private_cfg: file where private data is stored (such as mail credential)
- mail_alert: if true enable mail alerting in case of failure
- fs_out: if true write the twitter data on the file system
- std_out: if true write the twitter data on the console
- todo_out: if true write the json filename in the 'data/TODO' dir (to be consumed by an other process to BD (mysql, ...)
put all the agent definition files to the agent directory (one file per agent).
$ cat cfg/agents/*.json
to capture all the tweets where there is a mention of geneva word for several languages.
to capture all the tweets which are posted around Geneva area (Switzerland).
- type_doc : 'twitter'
- enable : if
truethis agent is launched
- type_filter : locations | filter | follow
- stream : filter | firehose (if you have the chance)
- consumer_key, consumer_secret, access_token_key, access_token_secret : personal keys given by twitter for using their APIs
more API twitter doc https://dev.twitter.com/streaming/overview/request-parameters
- mail_service : name of the mail service
- mail_auth_user : username credential of the mail service
- mail_auth_path : password credential of the mail service
- mail_from : who will send the mail
- mail_to : who want to be alerted
One mail is also sent when the system is started, you should received this mail on your mail box if all well configured.
note : supported mail system is given by nodemailer node module (here is the supported service https://github.com/andris9/nodemailer-wellknown#supported-services), but only gmail was tested for gmail, it is possible you have to decrease the security level of your mail account (so don't use a personal account) and to authorize specifically the application by using this url: https://g.co/allowaccess
Note that currently, we have 3 errors messages when twitter-harvest is launched. This is not important. Here are theses Error messages
- add more tests
- add extra option to add extra info in the output(from agents)
- add other api interface (not only the streaming API)
MIT © Arnaud Gaudinat
- chat the node twitter lib with Twit (for better handling of error)
- add the TODO option and directory to allow writing in DB
- add 2 digits on filenames and JSON extension
- add JSONschema validation