Rationale
A module for string tokenization after a particular pattern.
Namespace rationale
This module exists in the sirrobert-
namespace so as not to clutter npm
and to keep my related packages together. If there's enough interest, I can
move this into the general namespace.
Installation
Local installation
npm install --save sirrobert-tokenize
Global installation
npm install --global sirrobert-tokenize
Token Definitions
This module contains only one function, called "tokenize". It takes a string and returns an array of tokens.
Tokens are defined by the module as one of:
-
Quoted Strings. Quoted strings include single- or double-quoted strings. Double quoted strings are defined as
/("[^"\\]*(?:\\.[^"\\]*)*")/
and single-quoted strings are defined similarly as
/('[^'\\]*(?:\\.[^'\\]*)*')/
Note that this means that escaped quotes inside a quoted string are permitted, so these would be valid tokens:
"\""
,'\''
-
Consecutive Non-Whitespace. Any whitespace (or string boundary) sequence. That means the following are examples of valid tokens of this kind:
pickle
,potato-pants
,249nf9W$GH(WGOSWJEUR
. The definition of this kind of token is:/\S+/
-
Consecutive Whitespace. Any consecutive whitespace. The specific definition of this kind of token is:
/\s+/
Usage
This module provides one function that takes a string and gives an array of tokens as defined above.
The usage pattern goes: tokenize(string, [options-hash])
let tokenize = require("sirrobert-tokenize");
let str = "I am the \"Egg Man\" and the 'the \'Walrus\''";
tokenize(str);
/* ['I',
* 'am',
* 'the',
* '"Egg Man"',
* 'and',
* 'the \'Walrus\''
* ]
*/
Options
There are two options available for the options hash:
-
whitespace
How to handle whitespace in the string. Available values are:-
ignore
Disregard all whitespace. This is the default. -
append
Keep all whitespace. Append it to the token it comes immediately after. Whitespace at the beginning of the string is its own token. -
prepend
Keep all whitespace. Prepend it to the token it comes immediately before. Whitespace at the end of the string is its own token. -
tokenize
Keep all whitespace. Each set of whitespace is its own token.
-
-
trimInput
Whether to trim whitespace from the input string before processing. Defaults totrue
. Any values are evaluated as boolean values.
Here are examples of various combinations of options using the string above.
{ whitespace: 'ignore', trimInput: true }
[ 'I',
'am',
'the',
'"Egg Man"',
'and',
'the',
'\'the \'',
'Walrus\'\'' ]
{ whitespace: 'ignore', trimInput: false }
[ 'I',
'am',
'the',
'"Egg Man"',
'and',
'the',
'\'the \'',
'Walrus\'\'' ]
{ whitespace: 'append', trimInput: true }
[ 'I ',
'am ',
'the ',
'"Egg Man" ',
'and ',
'the ',
'\'the \'',
'Walrus\'\'' ]
{ whitespace: 'append', trimInput: false }
[ ' ',
'I ',
'am ',
'the ',
'"Egg Man" ',
'and ',
'the ',
'\'the \'',
'Walrus\'\' ' ]
{ whitespace: 'prepend', trimInput: true }
[ 'I',
' am',
' the',
' "Egg Man"',
' and',
' the',
' \'the \'',
'Walrus\'\'' ]
{ whitespace: 'prepend', trimInput: false }
[ ' I',
' am',
' the',
' "Egg Man"',
' and',
' the',
' \'the \'',
'Walrus\'\'',
' ' ]
{ whitespace: 'tokenize', trimInput: true }
[ 'I',
' ',
'am',
' ',
'the',
' ',
'"Egg Man"',
' ',
'and',
' ',
'the',
' ',
'\'the \'',
'Walrus\'\'' ]
{ whitespace: 'tokenize', trimInput: false }
[ ' ',
'I',
' ',
'am',
' ',
'the',
' ',
'"Egg Man"',
' ',
'and',
' ',
'the',
' ',
'\'the \'',
'Walrus\'\'',
' ' ]
LICENSE
MIT