zipfian-integer

1.0.1 • Public • Published

zipfian-integer

Get an integer between a min and max with skew towards either.
A JS port of Apache Commons Math's ZipfRejectionInversionSampler.

npm status node Travis build status JavaScript Style Guide

Table of Contents

Click to expand

Usage

const zipfian = require('zipfian-integer')
const sample = zipfian(1, 100, 0.2)
 
console.log(sample())
console.log(sample())

This logs two random integers between 1 and 100 with a skew of 0.2, thus more frequently returning integers < 50. You can optionally inject a (seeded) random number generator. The following example always returns the same integers in sequence unless you change the seed:

const random = require('pseudo-math-random')('a seed')
const sample = zipfian(1, 100, 0.2, random)

About

This module is an optionally deterministic random number generator. With a skew parameter of 0 it produces integers with a uniform distribution over the range min to max. As skew increases, it produces integers with a Zipfian distribution over that range: integers near the min become rapidly more likely than integers near the max.

💡 Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
Zipf's law

For example, words in a typical English text have a Zipfian skew of 1.07. If we have an array of all 171_000 English words ordered by their frequency and want to randomly pick a word, favoring the naturally most frequent words like "the" and "of":

const corpus = ['the', 'of', ..., 'ragtop', 'eucatastrophe']
const randomIndex = zipfian(0, corpus.length - 1, 1.07)
 
console.log(corpus[randomIndex()])
console.log(corpus[randomIndex()])

💡 The same relationship occurs in many other rankings unrelated to language.
Zipf's law

This means we're not limited to words in the English language. We can make use of Zipf's law in benchmarks of a key-value store, for example! For convenience zipfian-integer also supports a negative skew, more frequently returning integers leaning towards the max. Let's say the integers represent the keyspace of a key-value store with numeric keys in insertion order, then a positive skew favors the oldest keys while a negative skew favors the latest keys.

By using the integer returned by zipfian-integer not as a rank (that then requires a lookup) but as the key itself, we have a O(1) way to target keys in a keyspace of any size. If we then also inject a seeded random number generator into zipfian-integer, we can make every benchmark target the exact same keys. For instance to benchmark reads on the oldest keys or uniformly distributed writes on the entire keyspace.

The algorithm is fast, accurate and has a constant memory footprint. Other solutions like prob.js build a lookup table which costs time and memory. Performance of zipfian-integer depends mostly on skew and your choice of random number generator (see benchmarks below).

Visual Example

skew=1 vs skew=-1
skew=2 vs skew=-2
skew=1 vs skew=-1, same seed
skew=0

API

sample = zipfian(min, max, skew[, rng])

Create a new random number generator with a Zipfian distribution. The skew must be a floating-point number. The rng if provided must be a function that returns a random floating-point number between 0 (inclusive) and 1 (exclusive). It defaults to Math.random.

num = sample()

Get a random integer between min (inclusive) and max (inclusive).

Install

With npm do:

npm install zipfian-integer

Development

Verify

A small test is included to verify zipfian-integer results against results of the Apache Commons Math original. First generate test data (a few million combinations of parameters):

  • Install JDK and Maven
  • cd test/java
  • mvn compile
  • mvn -q exec:java > ../../data.ndjson

Then verify it:

  • cd ../..
  • npm i
  • node test/verify.js test/data.ndjson

Benchmark

$ node benchmark.js
node v10.14.1

n=1e2  skew=+0.0 pseudo-math-random x 8,446,287 ops/sec ±0.67%
n=1e2  skew=+0.0 Math.random x 9,556,211 ops/sec ±0.34%
n=1e6  skew=+0.0 pseudo-math-random x 8,141,930 ops/sec ±0.43%
n=1e6  skew=+0.0 Math.random x 9,509,349 ops/sec ±0.42%
n=1e12 skew=+0.0 pseudo-math-random x 7,418,569 ops/sec ±0.44%
n=1e12 skew=+0.0 Math.random x 8,905,792 ops/sec ±0.28%
n=1e2  skew=+1.0 pseudo-math-random x 12,010,890 ops/sec ±0.41%
n=1e2  skew=+1.0 Math.random x 19,650,279 ops/sec ±0.55%
n=1e6  skew=+1.0 pseudo-math-random x 11,954,408 ops/sec ±0.53%
n=1e6  skew=+1.0 Math.random x 19,752,283 ops/sec ±0.64%
n=1e12 skew=+1.0 pseudo-math-random x 11,579,715 ops/sec ±0.51%
n=1e12 skew=+1.0 Math.random x 17,908,808 ops/sec ±0.51%
n=1e2  skew=-0.5 pseudo-math-random x 7,907,162 ops/sec ±0.40%
n=1e2  skew=-0.5 Math.random x 9,388,148 ops/sec ±0.52%
n=1e6  skew=-0.5 pseudo-math-random x 7,879,909 ops/sec ±0.35%
n=1e6  skew=-0.5 Math.random x 9,196,799 ops/sec ±0.36%
n=1e12 skew=-0.5 pseudo-math-random x 7,250,634 ops/sec ±0.30%
n=1e12 skew=-0.5 Math.random x 8,636,395 ops/sec ±0.46%

Fastest is:
n=1e6  skew=+1.0 Math.random

License

The code of this port is licensed MIT © 2019-present Vincent Weevers. The original code (Apache Commons Math v3.6.1) is licensed under the Apache License 2.0. For details please see the full LICENSE. The NOTICE of Apache Commons Math follows:

Apache Commons Math
Copyright 2001-2016 The Apache Software Foundation

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

This product includes software developed for Orekit by
CS Systèmes d'Information (http://www.c-s.fr/)
Copyright 2010-2012 CS Systèmes d'Information

Package Sidebar

Install

npm i zipfian-integer

Weekly Downloads

98

Version

1.0.1

License

MIT

Unpacked Size

35.2 kB

Total Files

4

Last publish

Collaborators

  • vweevers