list-matcher
TypeScript icon, indicating that this package has built-in type declarations

1.0.2 • Public • Published

list-matcher

A library for making non-backtracking regular expressions from a list of strings.

Synopsis

import { qw, regex } from 'list-matcher'

// qw is a convenience function for creating arrays of strings
qw("these are the times that try men's souls")
// => ['these', 'are', 'the', 'times', 'that', 'try', 'men\'s', 'souls']
qw('abcdefg', '')
// => ['a', 'b', 'c', 'd', 'e', 'f', 'g']

// minimal case
regex(['cat'])
// => /cat/

// case-insensitive
regex(['CAT'], { flags: 'i' })
// => /cat/i

// more interesting
regex(qw('cat camel coulomb dog e f g h i'))
// => /(?:c(?:a(?:mel|t)|oulomb)|dog|[e-i])/

// it finds simple character classes!
regex(qw('0123456789', ''))
// => /\d/

// it can express character classes as ranges
regex(qw('abcdefghijklmnopqrstuvwxyz_0123456789', ''))
// => /[\d_a-z]/

// let's make this case-insensitive!
regex(qw('abcdefghijklmnopqrstuvwxyz_0123456789', ''), { flags: 'i' })
// => /\w/i

// non-ascii!
regex(qw('süß bloß'))
// => /(?:blo|sü)ß/u

// find word boundaries for me and do something smart with whitespace
regex(['  cat  ', 'cat', 'cat or dog', '!@##$%^&*'], { bound: true, normalizeWhitespace: true })
// => /(?:!@##\$%\^&\*|\bcat(?:\s+or\s+dog)?\b)/

// what becomes of common whitespace characters?
regex(qw('\t\r\n\f',''))
// => /[\t\n\f\r]/

// what do you get with an empty array?
regex([])
// => /(?!)/

// how about empty strings?
regex([''])
// => /(?!)/

// word boundaries in non-ascii strings?
regex(qw('süß bloß'), { bound: true })
// => /(?<![\p{L}\p{N}_])(?:blo|sü)ß(?![\p{L}\p{N}_])/u

// what if I want to make an expression for splitting strings and keeping the boundary bits?
let splitter = regex(qw('cat dog bird'), { capture: true })
// => /((?:bird|cat|dog))/
'the cat saw a birddog'.split(splitter)
// => [ 'the ', 'cat', ' saw a ', 'bird', '', 'dog', '' ]

// and can we use this to combine patterns?
let telephoneNumbers = [
    'b###-####b',
    '(###) ###-####b',
    'b###-###-####b',
    '+#-###-###-####b',
    '+## #### ######b'
]
regex(telephoneNumbers, { substitutions: { '#': '\\d', 'b': '\\b' } })
// => /(?:\(\d{3}\) \d{3}-|\+\d(?:-\d{3}-\d{3}-|\d \d{4} \d\d)|\b\d{3}-(?:\d{3}-)?)\d{4}\b/

// NOTE: you don't need to use single-character identifiers for sub-patterns
regex(qw('a aa aaaaa'), { substitutions: { 'a': 'foo', 'aa': 'bar', 'aaa': 'baz' } })
// => /(?:bazbar|bar|foo)/

// sub-patterns keys are matched greedily, so 'aaaaa' becomes /bazbar/, not /foofoofoofoofoo/, etc.

API

regex

Takes an array of strings and, optionally, a ListMatcherOptions object.

Returns a compiled RegExp.

import { regex } from 'list-matcher'

regex(['cat', 'cats'])
// => /cats?/
regex(['cat', 'dog', 'mouse', 'camel', 'dromedary', 'muskrat'], { bound: true, flags: 'i' })
// => /\b(?:ca(?:mel|t)|d(?:og|romedary)|m(?:ouse|uskrat))\b/i

ListMatcherOptions

Various directives controlling how regex builds a regular expression.

bound: boolean

regex should discover and preserve word boundaries. See Caveats for problems that may arise under certain javascript engines.

regex(['cat', '@#$'], { bound: true })
// => /(?:@#\$|\bcat\b)/
regex(['süß', 'bloß'], { bound: true })
// => /(?<![\p{L}\p{N}_])(?:blo|sü)ß(?![\p{L}\p{N}_])/u

Note, word boundary discovery is suspended for substituted patterns. See below.

capture: boolean

regex should generate expression in parentheses.

regex(qw('cat cats'), { capture: true })
// => /(cats?)/

The chief use envisioned for this is creating an expression that can be given to String.prototype.split to split a string while preserving the expression split on.

normalizeWhitespace: boolean

regex should trim marginal whitespace from words and treat internal whitespace as equivalent to /\s+/.

regex([' cat ', 'cat   dog'], { normalizeWhitespace: true })
// => /cat(?:\s+dog)?/

flags: string

Regular expression flags, such as i, m, s, g, u, and y.

regex(['CAT'], { flags: 'iug' })
// => /cat/giu

The order of flags in the flag string is irrelevant. Characters other than those listed above will be ignored.

Note, two of the flags, i and u, may have other effects on the generated regular expression.

i causes all phrases to be downcased.

u changes the definition of word boundaries. Instead of matching on the margin between \w and \W or the ends of the string, unicode word boundary expression match on the margin between any word or number character or _ and the ends of the string. In order to do this the pattern must use a lookbehind, which may not be implemented in every javascript engine. If you don't set the bound option, this is irrelevant.

Another peculiarity of the u flag is that regex will turn it on even if not requested if it finds a non-ASCII character among the phrases it is given.

substitutions: Record<string, string>

The substitution map, if supplied, defines portions of input phrases that should be replaced with other expressions.

regex(['b###-####b'], { substitutions: { '#': '\\d', 'b': '\\b' } })
// => /\b\d{3}-\d{4}\b/

This mechanism obviously allows you to feed the output of regex back into regex:

let subjects = regex(['Anne', 'Bob', 'Carol'], { bound: true })
let verbs    = regex(['eats', 'throws', 'pats'], { bound: true })
let objects  = regex(['clams', 'rocks', 'chunks'], { bound: true })

let clauses = regex(['V S O', 'S V O', 'S O V'], { normalizeWhitespace: true, substitutions: { V: verbs.source, S: subjects.source, O: objects.source } })
// => /(?:\b(?:Anne|Bob|Carol)\b\s+(?:\b(?:c(?:hunk|lam)|rock)s\b\s+\b(?:eat|pat|throw)s\b|\b(?:eat|pat|throw)s\b\s+\b(?:c(?:hunk|lam)|rock)s\b)|\b(?:eat|pat|throw)s\b\s+\b(?:Anne|Bob|Carol)\b\s+\b(?:c(?:hunk|lam)|rock)s\b)/

This example makes salient the regretable lack of the (?i:...) expression in javascript regular expressions. We must have case insensitivity for the whole expression or none of it, alas.

Note, word boundary discovery cannot occur if the potential word boundary in question is a substitution. See bound above.

qw

Takes a string and, optionally, a splitter, either a string or a regular expression.

Returns an array of non-empty strings.

import { qw } from 'list-matcher'

qw('  some  words  with  spaces  between  them  ')
// => ['some', 'words', 'with', 'spaces', 'between', 'them']
qw('some text', '')
// => ['s', 'o', 'm', 'e', ' ', 't', 'e', 'x', 't']
qw('bird, bird, bird: bird is the word', /bird/)
// => [', ', ', ', ': ', ' is the word']
qw('foo bar baz', /([aeiou])/)
// => ['f', 'o', 'o', ' b', 'a', 'r b', 'a', 'z']

This is an exported function of list-matcher chiefly because it made documenting the library easier and clearer. The implementation is quite simple:

export function qw(s: string, splitter: string | RegExp = /\s+/): string[] {
  return s.split(splitter).filter((p) => p)
}

The name and behavior of qw are inspired by the Perl qw array literal expression, which is like Ruby's %w, etc.

Advantages

  1. Compiling regular expressions this way produces more readable code. It's clear what regex(['cat', 'carp', 'camel']) does. It's less clear what /ca(?:mel|rp|t)/ does.
  2. You can use regex programmatically, building matchers for dynamic lists of great length.
  3. You don't need to worry about escaping metacharacters.

Embedding

If you wish to embed one regular expression in another, this won't work:

new RegExp(`other stuff with ${regex(list)} a regex in the middle`)

The interpolated regular expression will be delimited by forward slashes and may potentially include flag characters.

You can, however, do this:

new RegExp(`other stuff with ${regex(list).source} a regex in the middle`)

Efficiency

The regular expression generated by list-matcher are non-backtracking, so they should be optimally efficient. In some javascript engines this will actually give you a performance boost. Safari is the only engine where I've seen this. The others compile naive regular expressions like /cat|camel|carp/ into non-backtracking automata such as you would get from /ca(?:mel|rp|t)/ without any optimization.

The construction of regular expressions from lists is fairly swift, but still, as with the ordinary compilation of regular expressions, you should avoid doing this in a tight loop.

Caveats

Though I have taken some care to handle non-ASCII characters, I have not yet thoroughly tested higher number codepoints. In particular, I have not tested the matching of combined unicode characters.

list-matcher relies heavily on String.prototype.fromCharCode and String.prototype.codePointAt and other functions in the String prototype. Not all of these are implemented in all browsers. list-matcher therefore includes polyfills that supply these if they are absent. All polyfills are borrowed from the Mozilla Developer Network.

There are regular expression features used in finding and defining unicode word boundaries. These are not present in all javascript engines either, and they cannot be polyfilled.

If you either avoid unicode, avoid the bound option, or avoid the browers that have not implemented these features, you will be okay.

If you use substitutions, you are responsible for the patterns subsituted in; list-matcher won't check them.

Acknowledgements

As always I thank my co-workers at Green River and my family for indulging me while I tilt at windmills.

Package Sidebar

Install

npm i list-matcher

Weekly Downloads

0

Version

1.0.2

License

MIT

Unpacked Size

74.8 kB

Total Files

13

Last publish

Collaborators

  • dfhoughton