autoalign

1.0.3 • Public • Published

autoalign.js

align paired sequences automatically


NPM Travis Status


autoalign learns to align a list of paired but unaligned sequences by alternating between estimating a cost function based on the alignment statistics, and finding the optimal alignment with respect to that estimated cost function; in other words, it learns an edit distance from data, starting with a guess and iterating until convergence:

0. guess initial cost function
              🡓
1. align sequences using cost function         🡐                      
              🡓
2. estimate co-ocurrence statistics             🡑
              🡓
3. build cost function from statistics          🡑
              🡓
4. compute avg alignment cost                   🡑
              🡓
5. if no improvement go to 6, otherwise 1      🡒
              🡓
6. returned aligned pairs

Further details

Full API documentation

Install

npm i autoalign

to uninstall:

npm un autoalign

More install options

Usage

CLI

$ ./node_modules/.bin/autoalign -h
 
Usage: autoalign [options]
 
 
Options:
 
  -V, --version        output the version number
  -i --input <file>    input file; format: LEFT, RIGHT with white-space separated symbols
  -m --machine [file]  machine-readable output; format: LEFT, RIGHT, SCORE
  -u --human [file]    human-readable output, format: SCORE LEFT \n\r RIGHT
  -h, --help           output usage information

example using pronunciation dictionary Britfone as input:

$ ./node_modules/.bin/autoalign -i britfone.csv -m aligned.csv -u aligned.txt
 
[1:30:04 PM] read 15861 pairs from britfone.csv
[1:30:05 PM] avg.unnorm.cost: 2.2119283496010733
[1:30:06 PM] avg.unnorm.cost: 1.3996218560070568
[1:30:07 PM] avg.unnorm.cost: 1.383950761182861
[1:30:08 PM] avg.unnorm.cost: 1.3785348695282842
[1:30:09 PM] avg.unnorm.cost: 1.3784167544364208
[1:30:10 PM] avg.unnorm.cost: 1.3784167544364208
[1:30:11 PM] wrote 15861 alignments to aligned.csv
[1:30:11 PM] wrote 15861 alignments to aligned.txt

britfone.csv

... ... ...
S O L V I N G, s ɒ l v ɪ ŋ
S O M A, s əʊ m ə
S O M A L I A, s əʊ m ɑː l ɪə
S O M E, s ɐ m
... ... ...

aligned.csv

... .... ...
S O L V I N G, s ɒ l v ɪ · ŋ, 0.8138013477578335
S O M A, s əʊ m ə, 0.8147196394392467
S O M A L I A, s əʊ m ɑː l · ɪə, 0.7530844100226581
S O M E, s ɐ m ·, 0.7520632287232023
... .... ...

aligned.txt

... ... ...

0.81 S O L V I N G
     s ɒ l v ɪ · ŋ

0.81 W I  F E
     w aɪ f ·

0.81 P R O  T E I  N S
     p ɹ əʊ t · iː n z

0.81 F O  L D E R S
     f əʊ l d · ə z

... ... ...

API

> aa = require('autoalign')
 
{ GAP: [Getter],
  BY_SCORE: [Function],
  BY_ALPHA: [Function],
  SEP: [Getter],
  UNSEEN: [Getter],
  Stats: [Getter],
  vocab_sizes: [Getter],
  joint_stats_from_alignment: [Getter],
  smoothed_stats: [Getter],
  non_zero_smoothed_stats: [Getter],
  with_zero_smoothing: [Getter],
  alignment_cost_fn_estimator: [Getter],
  padding_cost_fn_estimator: [Getter],
  uniform_cost_fn_estimator: [Getter],
  pmi_cost_fn_from: [Getter],
  npmi: [Getter],
  pmi: [Getter],
  autoalign: [Function],
  csv_text_from_alignments: [Function],
  pretty_text_from_alignments: [Function] }
 
> pairs = [['SOLVING', 'sɒlvɪŋ'],
... ['SOMA', ['s','əʊ','m','ə']],
... ['SOMALIA', ['s','əʊ','ɑː','l','ɪə']],
... ['SOME', 'sɐm']]
 
[ [ 'SOLVING', 'sɒlvɪŋ' ],
  [ 'SOMA', [ 's', 'əʊ', 'm', 'ə' ] ],
  [ 'SOMALIA', [ 's', 'əʊ', 'ɑː', 'l', 'ɪə' ] ],
  [ 'SOME', 'sɐm' ] ]
 
> aa.autoalign(pairs, aa.uniform_cost_fn_estimator, aa.pmi_cost_fn_from)
 
[ [ [ 'S', 'O', 'L', 'V', 'I', 'N', 'G' ],
    [ 's', 'ɒ', 'l', 'v', '·', 'ɪ', 'ŋ' ],
    0.9007558120763637 ],
  [ [ 'S', 'O', 'M', 'A' ],
    [ 's', 'əʊ', 'm', 'ə' ],
    0.8754834816104571 ],
  [ [ 'S', 'O', 'M', 'A', 'L', 'I', 'A' ],
    [ 's', 'əʊ', '·', 'ɑː', 'l', '·', 'ɪə' ],
    0.8098602659026892 ],
  [ [ 'S', 'O', 'M', 'E' ],
    [ 's', 'ɐ', 'm', '·' ],
    0.8158098631621633 ] ]
 

Browser

<script src='./node_modules/autoalign/autoalign.js'></script>

autoalign will be available as a global

Full API documentation

Background

Autoalign uses the agreggated statistics of current alignments, computed with dynamic programming, in order to compute a better cost function for the next alignment pass. Given two aligned strings, it counts the co-ocurrences between the aligned symbols, adding those up over all alignments in the dataset. For instance these two alignments:

  P R O  T E I  N S
  p ɹ əʊ t · iː n z

  F O  L D E R S
  f əʊ l d · ə z

will result in co-occurence counts:

P:p   = 1
R:ɹ   = 1
O:əʊ  = 2
T:t   = 1
E:·   = 2
I:iː  = 1
N:n   = 1
S:z   = 2
F:f   = 1
L:l   = 1
D:d   = 1
R:ə   = 1

These statistics are first smoothed using good-turing estimation to avoid zero counts. Then they are used to estimate the strength of association between the symbols in the left (up) and right (down) alphabets (letters and phonemes in this example). This strength is measured with a normalised pointwise mutual information(NPMI) (x is left and y is right):

npmi

where pmi(x;y) is

pmi

and h(x, y) is the negative logarithm of the joint probability of the left and right symbols

(This use of pmi is analogous to log-odds ratio scoring in bioinformatics[1]).

NPMI's range is [-1, 1] and so it needs renormalising as the edit distance alignment costs must be non-negative. The end result is the final cost between the symbols.

Once the new cost function is estimated, the sequences are realigned and the process starts over again, stopping once the alignment cost for the whole dataset stops decreasing. Non-increase and convergence to a local minimum are guaranted as this is an instance of the Expectation Maximisation algorithm, similar to Viterbi Training and forced alignment in Speech Recognition.

autoalign is limited in three ways:

  • it only matches one symbol in the left alphabet to one symbol in the right alphabet alignment. This one-to-one alignment is inaccure for orthographic/phonetic sequences as phones and letters are often in a many-to-many correspondence. For tools that don't have this limitation see Related

  • it only uses the optimal alignment(s) to compute the counts, this misses the contribution of less likely aligments and makes the algorithm very sentitive to initial conditions[2][3]. For tools that use the full set of paths see Related

  • it can only compute monotonic alignments. This is not a problem for most linguistic data, with a notable exception being machine translation

[1] Biological Sequence Analysis, Durbin, Eddy, Krogh & Mitchison, 1998.

[2] Applying Many-to-Many Alignments and Hidden Markov Models to Letter-to-Phoneme Conversion, Jiampojamarn, Kondrak & Sherif, 2007.

[3] Learning String Distance, Ristad & Yianilos, 1998.

NPM tasks

test

npm test

Other install options

global npm install:

npm i -g autoalign
npm ln autoalign //ensures module is availabe to nodejs

then cli is, e.g. autoalign -i britfone.csv -m aligned.csv -u aligned.txt and npm un -g autoalign to uninstall

library-only git install:

git clone git@github.com:josellarena/autoalignjs.git
cd autoalignjs

then in node require('./autoalign)

and in the browser <script src='autoalign.js'></script>

Related

M2MAligner - A HMM-based aligner that can also do many-to-many alignments

AltFstAligner - Another HMM-based aligner, an intended replacement for M2MAligner

See Also

Britfone - a British English phonetic dictionary.

In order to autoalign Britfone, the word column needs to be processed such that letters are separated by whitespace

Changelog

see Changelog

License

MIT © Jose Llarena

Package Sidebar

Install

npm i autoalign

Weekly Downloads

0

Version

1.0.3

License

MIT

Last publish

Collaborators

  • josellarena