feature-scaler

1.0.0 • Public • Published

Build Status

feature-scaler

feature-scaler is a utility that transforms a list of arbitrary JavaScript objects into a normalized format suitable for feeding into machine learning algorithms. It can also decode encoded data back into its original format.

Motivation: I use Andrej Karpathy's excellent convnetjs library to experiment with neural networks in JavaScript and often have to preprocess my data before training a network. This utility makes it easy to encode data in a format usable by convnetjs.

"Why JavaScript?" is a fair question - Python's scikit-learn has most of the data preprocessing features you may need. I wrote this mainly because I wanted an easy way to use convnetjs without communicating across languages. If your data is big enough that convnetjs or the performance of the V8 engine in node.js is the limiting factor in your workflow, don't use JavaScript!

Field types currently supported: ints, floats, bools, and strings.

Check out tests/main.spec.js for a demo of this library in action.

In the following documentation, I'll use planetList as the example data set we're transforming. It looks like this:

const planetList = [
  { planet: 'mars', isGasGiant: false, value: 10 },
  { planet: 'saturn', isGasGiant: true, value: 20 },
  { planet: 'jupiter', isGasGiant: true, value: 30 }
]

The independent variables are planet and isGasGiant. The dependent variable is value.

encode(data, opts = { dataKeys, labelKeys })

  • data: list of raw data you need encoded. Assumptions: all entries in this list have the same structure as the first entry in the list. If the first element in data has a key called isGasGiant, and data[0].isGasGiant === true, isGasGiant should be a boolean for all objects in the list.
  • opts
  • opts.labelKeys - list of keys you are predicting values for (value).
  • opts.dataKeys optional - list of independent keys (planet, isGasGiant). If not provided, defaults to all keys minus opts.labelKeys.

Example usage:

const dataKeys = ['planet', 'isGasGiant'];
const labelKeys = ['value']
const encodedInfo = encode(planetList, { dataKeys: ['value']});
 
// encodedInfo.data
[ [ 1, 0, 0, 0, -1 ], [ 0, 1, 0, 1, 0 ], [ 0, 0, 1, 1, 1 ] ]
// Note: as is the norm with machine learning algorithms,
// "label" data is at the end of each row.
// encodedInfo.data[0][4] === -1; the scaled label value for Mars.
 
// encodedInfo.decoders - can be treated as a black box
[
  { key: 'planet', type: 'string', offset: 3, lookupTable: ['mars','saturn','jupiter'] },
  { key: 'isGasGiant', type: 'boolean' },
  { key: 'value', type: 'number', mean: 20, std: 10 }
]

Each entry in the "decoders" list is metadata from the original dataset. It contains information on how to transform an encoded row back into the original { key: value } pairs. Your code should not modify this list. The only thing you should do with it is feed it back into decode, described below.

Note: encodedInfo can safely be serialized to JSON and saved for later use with JSON.stringify(encodedInfo).

decode(encodedData, decoders)

  • encodedData - the data from encode output
  • decoders - the decoders from encode output

It returns the list of data in its original format.

decodeRow(encodedRow, decoders)

Similar to decode, but operates on a single row. e.g.

decodeRow(encodedData[0], decoders) === decode(encodedData, decoders)[0]

Technical details

The short version is this library encodes data in the following ways:

  • Number fields: (n - mean) / stddev
  • Boolean fields: n ? 1 : 0
  • String fields: one-hot encoding (see below).

One-hot encoding

Standardizing numbers and booleans is easy, but categorical string data is a little trickier. In the example above, transforming ['mars', 'jupiter', 'saturn'] into a single number value falsely implies* there is an ordering to the underlying value. Suppose you had a variable that represented the weather; there is no logical ordering to ['rain', 'sun', 'overcast']. If we naively had a sinlge numeric "weather" column where rain=0, sun=1, overcast=2, some machine learning algorithms would treat that field as "ordered".

Instead, we need to map these strings to a list of single-valued binary values. In the planets example, we see the following encodings:

  • mars == [0, 0, 1]
  • saturn == [0, 1, 0]
  • jupiter == [1, 0, 0]

We can feed this into an arbitrary machine learning algorithm without the possibility of it (incorrectly) inferring an ordering to our data.

* In our example, there is indeed an ordering to the planets! If the ordering is important, add a calculated field to the data before encoding. You could add a numberOfPlanetFromSun integer field to each record before encoding if the ordering of categorical data is important.

Further Reading

Todo

  • Add support for decoding a single value (currently only decoding a whole row is supported)
  • Add support for unrolling nested objects
  • Add support for missing data
  • Currently it standardizes numeric values; perhaps add support for scaling numeric values to [0, 1].

Contributions welcome! Please include unit tests, and ensure both npm run test and npm run lint pass without warning.

Versions

Current Tags

  • Version
    Downloads (Last 7 Days)
    • Tag
  • 1.0.0
    5
    • latest

Version History

  • Version
    Downloads (Last 7 Days)
    • Published
  • 1.0.0
    5

Package Sidebar

Install

npm i feature-scaler

Weekly Downloads

4

Version

1.0.0

License

ISC

Last publish

Collaborators

  • maxt