Negligent Pachinko Machine

    string-comparison
    TypeScript icon, indicating that this package has built-in type declarations

    1.1.0 • Public • Published

    string-comparison

    npm bundle size npm GitHub stars GitHub license

    JavaScript implementation of tdebatty/java-string-similarity

    A library implementing different string similarity, distance and sortMatch measures. A dozen of algorithms (including Levenshtein edit distance and sibblings, Longest Common Subsequence, cosine similarity etc.) are currently implemented. Check the summary table below for the complete list...

    Download & Usage

    download

    npm install string-comparison --save
    yarn add string-comparison

    usage

    let stringComparison = require('string-comparison')
    
    const Thanos = 'healed'
    const Rival = 'sealed'
    const Avengers = ['edward', 'sealed', 'theatre']
    
    // use by cosine
    let cos = stringComparison.cosine
    
    console.log(cos.similarity(Thanos, Rival))
    console.log(cos.distance(Thanos, Rival))
    console.log(cos.sortMatch(Thanos, Avengers))

    OverView

    The main characteristics of each implemented algorithm are presented below. The "cost" column gives an estimation of the computational cost to compute the similarity between two strings of length m and n respectively.

    Measure(s) Normalized? Metric? Type Cost Typical usage
    Jaccard index similarity
    distance
    sortMatch
    Yes Yes Set O(m+n)
    Cosine similarity similarity
    distance
    sortMatch
    Yes No Profile O(m+n)
    Sorensen-Dice coefficient similarity
    distance
    sortMatch
    Yes No Set O(m+n)
    Levenshtein similarity
    distance
    sortMatch
    No Yes O(m*n)
    Jaro-Winkler similarity distance
    sortMatch
    Yes No O(m*n) typo correction

    Normalized, metric, similarity and distance

    Although the topic might seem simple, a lot of different algorithms exist to measure text similarity or distance. Therefore the library defines some interfaces to categorize them.

    (Normalized) similarity and distance

    • StringSimilarity : Implementing algorithms define a similarity between strings (0 means strings are completely different).
    • NormalizedStringSimilarity : Implementing algorithms define a similarity between 0.0 and 1.0, like Jaro-Winkler for example.
    • StringDistance : Implementing algorithms define a distance between strings (0 means strings are identical), like Levenshtein for example. The maximum distance value depends on the algorithm.
    • NormalizedStringDistance : This interface extends StringDistance. For implementing classes, the computed distance value is between 0.0 and 1.0. NormalizedLevenshtein is an example of NormalizedStringDistance.

    Levenshtein

    The Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

    It is a metric string distance. This implementation uses dynamic programming (Wagner–Fischer algorithm), with only 2 rows of data. The space requirement is thus O(m) and the algorithm runs in O(m.n).

    const Thanos = 'healed'
    const Rival = 'sealed'
    const Avengers = ['edward', 'sealed', 'theatre']
    let ls = Similarity.levenshtein
    
    console.log(ls.similarity(Thanos, Rival))
    console.log(ls.distance(Thanos, Rival))
    console.log(ls.sortMatch(Thanos, Avengers))
    
    // output
    0.8333333333333334
    1
    [
      { member: 'edward', index: 0, rating: 0.16666666666666663 },
      { member: 'theatre', index: 2, rating: 0.4285714285714286 },
      { member: 'sealed', index: 1, rating: 0.8333333333333334 }
    ]

    Longest Common Subsequence

    The longest common subsequence (LCS) problem consists in finding the longest subsequence common to two (or more) sequences. It differs from problems of finding common substrings: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences.

    It is used by the diff utility, by Git for reconciling multiple changes, etc.

    The LCS distance between strings X (of length n) and Y (of length m) is n + m - 2 |LCS(X, Y)| min = 0 max = n + m

    LCS distance is equivalent to Levenshtein distance when only insertion and deletion is allowed (no substitution), or when the cost of the substitution is the double of the cost of an insertion or deletion.

    This class implements the dynamic programming approach, which has a space requirement O(m.n), and computation cost O(m.n).

    In "Length of Maximal Common Subsequences", K.S. Larsen proposed an algorithm that computes the length of LCS in time O(log(m).log(n)). But the algorithm has a memory requirement O(m.n²) and was thus not implemented here.

    const Thanos = 'healed'
    const Rival = 'sealed'
    const Avengers = ['edward', 'sealed', 'theatre']
    let lcs = Similarity.lcs
    
    console.log(lcs.similarity(Thanos, Rival))
    console.log(lcs.distance(Thanos, Rival))
    console.log(lcs.sortMatch(Thanos, Avengers))
    
    // output
    0.8333333333333334
    2
    [
      { member: 'edward', index: 0, rating: 0.5 },
      { member: 'theatre', index: 2, rating: 0.6153846153846154 },
      { member: 'sealed', index: 1, rating: 0.8333333333333334 }
    ]

    Metric Longest Common Subsequence

    Distance metric based on Longest Common Subsequence, from the notes "An LCS-based string metric" by Daniel Bakkelund. http://heim.ifi.uio.no/~danielry/StringMetric.pdf

    The distance is computed as 1 - |LCS(s1, s2)| / max(|s1|, |s2|)

    const Thanos = 'healed'
    const Rival = 'sealed'
    const Avengers = ['edward', 'sealed', 'theatre']
    let mlcs = Similarity.mlcs
    
    console.log(mlcs.similarity(Thanos, Rival))
    console.log(mlcs.distance(Thanos, Rival))
    console.log(mlcs.sortMatch(Thanos, Avengers))
    
    // output
    0.8333333333333334
    0.16666666666666663
    [
      { member: 'edward', index: 0, rating: 0.5 },
      { member: 'theatre', index: 2, rating: 0.5714285714285714 },
      { member: 'sealed', index: 1, rating: 0.8333333333333334 }
    ]

    Cosine similarity

    Like Q-Gram distance, the input strings are first converted into sets of n-grams (sequences of n characters, also called k-shingles), but this time the cardinality of each n-gram is not taken into account. Each input string is simply a set of n-grams. The Jaccard index is then computed as |V1 inter V2| / |V1 union V2|.

    Distance is computed as 1 - similarity. Jaccard index is a metric distance.

    Sorensen-Dice coefficient

    Similar to Jaccard index, but this time the similarity is computed as 2 * |V1 inter V2| / (|V1| + |V2|).

    Distance is computed as 1 - similarity.

    API

    • similarity.
    • distance.
    • sortMatch

    similarity

    Implementing algorithms define a similarity between strings

    params

    1. thanos [String]
    2. rival [String]

    return

    Return a similarity between 0.0 and 1.0

    distance

    Implementing algorithms define a distance between strings (0 means strings are identical)

    params

    1. thanos [String]
    2. rival [String]

    return

    Return a number

    sortMatch

    params

    1. thanos [String]
    2. avengers [...String]

    return

    Return an array of objects. ex:

    [
      { member: 'edward', rating: 0.16666666666666663 },
      { member: 'theatre', rating: 0.4285714285714286 },
      { member: 'mailed', rating: 0.5 },
      { member: 'sealed', rating: 0.8333333333333334 }
    ]

    CHANGLOG

    CHANGLOG

    MIT

    MIT

    Install

    npm i string-comparison

    DownloadsWeekly Downloads

    3,286

    Version

    1.1.0

    License

    MIT

    Unpacked Size

    23.5 kB

    Total Files

    8

    Last publish

    Collaborators

    • simida