sirrobert-tokenize

1.0.2 • Public • Published

Rationale

A module for string tokenization after a particular pattern.

Namespace rationale

This module exists in the sirrobert- namespace so as not to clutter npm and to keep my related packages together. If there's enough interest, I can move this into the general namespace.

Installation

Local installation

npm install --save sirrobert-tokenize

Global installation

npm install --global sirrobert-tokenize

Token Definitions

This module contains only one function, called "tokenize". It takes a string and returns an array of tokens.

Tokens are defined by the module as one of:

  • Quoted Strings. Quoted strings include single- or double-quoted strings. Double quoted strings are defined as

    /("[^"\\]*(?:\\.[^"\\]*)*")/
    

    and single-quoted strings are defined similarly as

    /('[^'\\]*(?:\\.[^'\\]*)*')/
    

    Note that this means that escaped quotes inside a quoted string are permitted, so these would be valid tokens: "\"", '\''

  • Consecutive Non-Whitespace. Any whitespace (or string boundary) sequence. That means the following are examples of valid tokens of this kind: pickle, potato-pants, 249nf9W$GH(WGOSWJEUR. The definition of this kind of token is:

    /\S+/
    
  • Consecutive Whitespace. Any consecutive whitespace. The specific definition of this kind of token is:

    /\s+/
    

Usage

This module provides one function that takes a string and gives an array of tokens as defined above.

The usage pattern goes: tokenize(string, [options-hash])

let tokenize = require("sirrobert-tokenize");

let str = "I am the \"Egg Man\" and the 'the \'Walrus\''";

tokenize(str);

/* ['I',
 *  'am',
 *  'the',
 *  '"Egg Man"',
 *  'and',
 *  'the \'Walrus\''
 * ]
 */

Options

There are two options available for the options hash:

  • whitespace How to handle whitespace in the string. Available values are:

    • ignore Disregard all whitespace. This is the default.
    • append Keep all whitespace. Append it to the token it comes immediately after. Whitespace at the beginning of the string is its own token.
    • prepend Keep all whitespace. Prepend it to the token it comes immediately before. Whitespace at the end of the string is its own token.
    • tokenize Keep all whitespace. Each set of whitespace is its own token.
  • trimInput Whether to trim whitespace from the input string before processing. Defaults to true. Any values are evaluated as boolean values.

Here are examples of various combinations of options using the string above.

{ whitespace: 'ignore', trimInput: true }
[ 'I',
  'am',
  'the',
  '"Egg Man"',
  'and',
  'the',
  '\'the \'',
  'Walrus\'\'' ]

{ whitespace: 'ignore', trimInput: false }
[ 'I',
  'am',
  'the',
  '"Egg Man"',
  'and',
  'the',
  '\'the \'',
  'Walrus\'\'' ]

{ whitespace: 'append', trimInput: true }
[ 'I ',
  'am ',
  'the ',
  '"Egg Man" ',
  'and ',
  'the ',
  '\'the \'',
  'Walrus\'\'' ]

{ whitespace: 'append', trimInput: false }
[ ' ',
  'I ',
  'am ',
  'the ',
  '"Egg Man" ',
  'and ',
  'the ',
  '\'the \'',
  'Walrus\'\'  ' ]

{ whitespace: 'prepend', trimInput: true }
[ 'I',
  ' am',
  ' the',
  ' "Egg Man"',
  ' and',
  ' the',
  ' \'the \'',
  'Walrus\'\'' ]

{ whitespace: 'prepend', trimInput: false }
[ ' I',
  ' am',
  ' the',
  ' "Egg Man"',
  ' and',
  ' the',
  ' \'the \'',
  'Walrus\'\'',
  '  ' ]

{ whitespace: 'tokenize', trimInput: true }
[ 'I',
  ' ',
  'am',
  ' ',
  'the',
  ' ',
  '"Egg Man"',
  ' ',
  'and',
  ' ',
  'the',
  ' ',
  '\'the \'',
  'Walrus\'\'' ]

{ whitespace: 'tokenize', trimInput: false }
[ ' ',
  'I',
  ' ',
  'am',
  ' ',
  'the',
  ' ',
  '"Egg Man"',
  ' ',
  'and',
  ' ',
  'the',
  ' ',
  '\'the \'',
  'Walrus\'\'',
  '  ' ]

LICENSE

MIT

Package Sidebar

Install

npm i sirrobert-tokenize

Weekly Downloads

1

Version

1.0.2

License

MIT

Last publish

Collaborators

  • sirrobert