node package manager
Painless code sharing. npm Orgs help your team discover, share, and reuse code. Create a free org »

sqlite-bayes

Jeez! Not another!

NodeJs is teaming with Naive Bayes Classifiers so before someone asks why we need another, let me explain why this classifier is different and why you must love it! :-)

Motivation

Almost all naive classifiers out there save and consume their data in JSON format, allowing you to persist the data to file.

While this works for most cases, it is problematic when you want to train your classifier over several thousands of large documents. It becomes worse when you want to train persistently over a long time.

Imagine you were tracking BuzzFeed headlines and training your classifier to understand clickbait. Would it be convenient to train over a period of months using a JSON file that has to be loaded & held in memory?

What happens if your code exits unexpectedly on the millionth document just before you had persisted to disk?

Is this method of training sustainable and most of scalable?

If You see my point read on....

So, turns out there's not a simple sql based Naive Bayes classifier out there. Know one? Show me please.

Actually, there are a few gists and examples but must are written for a specific dataset and their logic is often convoluted, involving copying this data to that temporary table and so on.

But Naive Bayes classifiers, in their simplest form are simple. All they need to know is which document goes into what class. The rest, really, is just arithmetic.

So this classifier implements a database schema that mimicks the JSON objects encoded with classes, documents and their respective counts.

Using simple, straightforward SQL, your database is atomically updated each time you classify a new document and the probabilities change automagically.

You will never need to load heavy files ever again, and because this is SQL(Lite), you can carry and plugin your data wherever you go!

Best of all, you can train whenever you come across new documents without affecting any ongoing classifications.

npm install bayes

Usage

 
var bayes = require('./lib/naive_bayes');
var path=require('path');
 
//Some Options 
var options={
     "dbPath":path.join(__dirname,'data'), //path to save database 
     "dbName":'sentiment-db', //database name 
     "stopwords":['en','sw'], //stopwords to use. See https://www.npmjs.com/package/multi-stopwords for more 
     "stemmer":'lancaster', //what stemmers do you want to use. Currently suppports 'lancaster' & 'porter' stemmers via https://www.npmjs.com/package/natural. 
     "returnProbabilities":3, //how many probabilities do you want returned. Important especially where you have many classes 
     "trace":true //do you want log what's happening? 
};
 
 
var classifier = bayes(options/*All options are optional!*/);
 
//  teach our classifier a few facts 
classifier.learn('amazing, awesome movie!! Yeah!! Oh boy.', 'positive')
classifier.learn('Sweet, this is incredibly, amazing, perfect, great!!', 'positive')
classifier.learn('terrible, shitty thing. Damn. Sucks!!', 'negative');
 
 
// //must save docs...to commit data to database 
// Also, if it's the first time you are traing the classifier, then run categorize after data has been commited 
classifier.saveDocs(function(){
 
     //now ask it to categorize a document it has never seen before 
     classifier.categorize('this is some incredibly shitty day',function(classification){
             console.log(JSON.stringify(classification,0,4));
     });
 
});
 
 

API

var classifier = bayes([options])

Returns an instance of a Sqlite-Bayes Classifier.

Pass in an optional options object to configure the instance. If you specify a stemmer function in options, it will be used as the instance's tokenizer. The default tokenizer removes punctuation and splits on spaces.

NOTE:

  • Once you have created a database using one stemmer (or none), you cannot then change this stemmer in the future. SQLite-Bayes is stores your initial stemmer and will always use it no matter what. This helps avert a situation where you would have data stemmed differently used to categorize the same piece of text. Which would be extremely inacurate.
  • Text entered to be classified goes through the same stemming that the database was initialized with to harmonize it with the data prior to any classification. This process is automatic and requires no intervention from you!

classifier.learn(text, category)

Teach your classifier what category the text belongs to. The more you teach your classifier, the more reliable it becomes. It will use what it has learned to identify new documents that it hasn't seen before.

classifier.categorize(text)

Returns the category it thinks text belongs to. Its judgement is based on what you have taught it with .learn().

Heads Up

This classifier borrows a lot fron Bayes by Tolga Tezel.