# String randomness score generator

A lightweight, 0 dependency package to generate a randomness score for a string. Used to identify if a string is gibberish or word-like. Some applications include -

• Identify if a user is typing something or just banging the keyboard
• Determine if a string is an API Key, Access Token, etc
• Check if a string is something randomly generated by a computer

## Usage

The tool returns back a randomness score for a string. You can tune the conditions according to your use case, but, generally, a score above 4 signifies that the input string is random.

### NPM package

• Install the npm package
• Import and use it in your code like
``````const Model = require('./Model');

// Remember to load the model before using it

const score = Model.score("helloWorld");
``````

## How does it work?

### Training

• At its core the model uses a bigram model to calculate the probability of the next character, given a character (Using a n-gram model would give better results, but its WIP).
• We parse through a comprehensive list of words in the English language to create a 2D table which stores the occurrence of each character following the current character.
• While generating this table, we also add a special `<.>` character at the start and end of each word to get the count of words starting & ending with a character. This table is then row-normalized to make the data uniform. This gives us the probability of a character following the current character. These probabilities are used in score calculation.

### Score Calculation

• We first parse the word to convert it to lowercase and remove any extra characters.
• Then, since we have a bigram model, we break down the word into pairs of 2. ( including the special start and end `<.>` character )
• Next, we get the log of the probability of this pair (As these probabilities are minute, their log is a better uniform measure)
• We add these log values for all the pairs in the word.
• As this sum is a negative number, we invert it to get a positive value.
• We divide this score by the number of characters in the word to get the final score.

## Contribute

• Create a fork and clone it.
• To contribute to the model generation part, navigate to the modelGenerator/ folder . This contains a python notebook used for generating the model. Feel free to suggest improvements to the model
• To contribute to the npm package, go into the modelGenerator/ directory which contains the source code for the npm package, as well as the latest model being used for calculation

## Gotchas & Improvements

• The model is trained on English words and may not work for other languages.
• To reduce training complexity the model is case-insensitive.
• The current model is not very accurate for very short strings.
• The dataset the model is built on does not have first class support for numbers and some special characters, so strings involving these can be inaccurate.
• The dataset does not include keyboard-common strings like "qwerty", so the results may not be correct for strings of these category.
• The current model is a bigram. We can use Deep Learning to replace this with a n-gram model for better results.

## Maintainer

• Pranav Joglekar

