Multi-Lingual Word Alignment Prediction (Word MAP)
Installation
npm i word-map
Introduction
Word alignment prediction is the process of associating (mapping) words from some primary text with corresponding words in a secondary text. This tool uses statistical algorithms to determine which words or phrases in two texts are equivalent in meaning.
Alignments provide many valuable benefits to translators including:
- Ensures that all terms and phrases in the primary text have a proper translation in the secondary text.
- Provides in-context vocabulary suggestions to the translator.
- Helps prevent inconsistencies in the translation.
Terms
- Primary Text: The original biblical texts (traditionally referred to as "source text").
- Primary Language: The language used in a
primary text
. - Secondary Text: The translation of a
primary text
. - Secondary Language: The language used in a
secondary text
. See alsogateway language
. - Gateway (Secondary) Languages: Those languages that compose the minimum set of trade languages in the world. See also
secondary language
. - Ternary Text: The translation of a
secondary text
. - Ternary (Minor) Language: A non-trade language spoken by a small group of people. i.e. a language that is not a
gateway language
. Also, the language used in aternary text
. - Token: A unit of text within a sentence such as a word or punctuation.
- n-gram (word or phrase): A contiguous sequence of n tokens from a given sample of text. An n-gram containing a single token is referred to as a "unigram"; two tokens is a "bigram", etc. For example: "hello" is a unigram, while "hello world" is a bigram.
- Unaligned Sentence Pair: A sentence in two languages that need to be aligned. e.g. a sentence from a primary text and secondary text.
- Alignment: Two individual
n-grams
that have been matched from two texts. e.g. from a primary text and secondary text. - Saved Alignment: An alignment that has been approved/corrected by the user.
- Engine: Contains a index of every permutation of possible
n-gram
alignments
. And an index ofsaved alignments
. - Corpus: The input dataset which is the primary and secondary text given as a list of
unaligned sentence pairs
. This is used in training the engine. Note: This is not input directly provided by the user. - Tokenization: Separating a sentence into individual words and punctuation.
- Normalization: A text might use characters from multiple utf8 standards to represent the same visual character. The process of normalization reduces visual character representation to a single utf8 character. A text using a single utf8 standard is considered normalized.
Note: it is important to understand that our definition of n-grams is "contiguous". It is possible and even beneficial to support non-contiguous n-grams, however this greatly increases the resources required by the system.
Use Cases
- Aligning a primary text with a secondary text e.g. when generating word maps for gateway languages.
- Aligning a secondary text with a ternary text.
- Aligning a primary text to a ternary text (using the secondary as a proxy)
The Need
Existing tools require large data sets, complex running environments, and are usually limited to running in a server environment.
We need a tool that:
- runs on the client with minimal configuration.
- works with existing web browser technology.
- integrates with translationCore and related tools.
- works without an Internet connection.
- does not have a minimum corpus size.
- requires minimal system resources.
Outline
Corpus Preprocessing
- Corpus must be prepared as a list of unaligned tokenized sentence pairs and optionally normalized.
Training/Encoding
- Receive corpus
- Generate n-grams for corpus
- Generate permutations
- KeyStore and tally occurrences
- Receive saved alignments (as a list of known permutations)
- KeyStore and tally occurrences
Prediction/Decoding
- Receive unaligned sentence pair
- Generate n-grams for unaligned sentence pair
- Generate permutations for the n-grams between languages
- Selecting subset of data that is relevant
- relevant to words in provided unaligned sentence pair
- from both corpus and saved alignments indices
- Statistical algorithms
- Static scoring
- Scoring
- Weighted average of all scores
- Alignment Prediction
- Selection of best alignments via Process of elimination
- Pick the best
- Eliminate non-usable conflicts
- Penalize usable conflicts
- Repeat until all words are covered
- Order selected alignments to the primary unaligned sentence word order
- Selection of best alignments via Process of elimination
Requirements
Tool Requirements
- Learn and adapt to regularly changing input without relying on previously stored engines.
- Must not store trained engines.
- Easy to add new lines of corpus.
- Easy to manually add saved alignments.
- Be fast enough to align a single sentence in real time.
- Differentiate multiple occurrences of the same word within a sentence.
TODO: Support for metadata like lemma, stems, strong's numbers and parts of speech in all languages. Also support contextually saved alignments. e.g. alignments are grouped by verse. In order to support metadata the input will have to be an object. String input should also be accepted, but will result in a "dumb" object. We need to define the object structure and what features of the tool will be enabled for each key(s) in the object.
Input Prerequisites
- The corpus must be tokenized
- The unaligned sentence pairs must be tokenized.
- The corpus and unaligned sentence pairs must be in the same primary and secondary languages.
- Input must be in utf8.
- Input characters should be normalized for optimum results.
- Input tokens should be objects when including metadata.
Overview of Operation
The following is a non-technical description of how this tool could be used.
- The tool is initialized with some corpus and previously saved alignments.
- The tool trains a new engine using the provided corpus.
- The user gives the tool an unaligned sentence pair.
- The tool generates and returns a list of possible alignments for the sentence pair provided by the user.
- The user chooses the correct alignments to use in their work.
- The alignments chosen by the user is given back to the tool as
saved alignments
to increase accuracy of future predictions.
Engine Training
An engine is composed of two indices; The corpus index and saved alignments index.
Corpus KeyStore
This index is generated by iterating over the corpus. For each unaligned sentence pair in the corpus:
- Filter out punctuation from the corpus.
- Generate n-grams for each sentence (n-grams are often limited to lengths of 1 to 3).
- Generate permutations of all possible combinations of n-grams between primary and secondary sentences so that we can:
- Tally the occurrences of each permutation across the entire corpus.
NOTE: while generating permutation, include a match against 0-length n-grams. This accounts for the case where an n-gram may not have a valid match in the apposing language.
Code samples:
// Note: the token strings would be objects by the time it gets here// Strings are used for simplicity in conveying conceptscorpus = "a" "b" "c" "." "d" "e" "f" "|" // e.g. [["primary", "language", "sentence", "."], [...]] ...; // first sentence in primary textngrams00 = "a" "b" "c" "a" "b" "b" "c" "a" "b" "c"; // first sentence in secondary textngrams01 = "d" "e" "f" "d" "e" "e" "f" "d" "e" "f"; // temporary calculationspermutations = // permutations of first sentence // [<primary n-gram>, <secondary n-gram>], ... "a""d" "a""e" "a""f" "b""d" "b""e" "b""f" "c""d" "c""e" "c""f" ... "b""c""d" "b""c""e" "b""c""f" "b""c""d""e" "b""c""e""f" "b""c""d""e""f" ... // permutations of other sentences ...; // keyed by n-grams in the primary text.// These are the number of times an n-gram occurs in the primary text.primaryNgramFrequencyIndex = "a": 5 "b": 2 ... "a b": 1 ... // keyed by n-grams in the secondary text.// These are the number of occurrences an n-gram occurs in the secondary text.secondaryNgramFrequencyIndex = "d": 5 "e": 2 ... "d e": 1 ... // keyed by n-grams in the primary text.// These the frequency of possible alignments within the primary text.primaryAlignmentFrequencyIndex = "a": "d": 1 "e": 3 "f": 12 ... "b": "d": 4 "e": 1 "f": 1 ... "b c": "d": 1 "e": 1 "f": 1 "d e": 1 "e f": 1 "d e f": 1 ... // keyed by n-grams in the secondary text.// These are the frequency of possible alignments within the secondary text.secondaryAlignmentFrequencyIndex = "d": "a": 1 "b": 2 "c": 12 ... "e": "a": 4 "b": 1 "c": 1 ... ...
Saved Alignments KeyStore
This index is generated by iterating over all saved alignments. For each saved alignment, it tallies occurrences. Since these alignments have already been verified by the user there is no need to generate n-grams or permutations.
See the code examples for the Corpus KeyStore above.
Performance Notes
Most other engines require millions of lines of corpus on which they are trained. This amount of data takes enormous amounts of time to process. Also, these engines include in their training, heavy statistical algorithms that are computed on every permutation. Only a small percentage of the data is used in the decoding of the alignments for a single sentence. The result is a lot of wasted time and resources.
In our engine implementation the corpus is first indexed and stored without running statistical algorithms. This allows us to dynamically add new corpus to the index quickly without running expensive algorithms.
When a sentence is decoded we only use the relevant data from the index. Since we are only using a subset of the data our statistical algorithms can run exponentially faster.
Prediction Algorithms
The alignment prediction begins when the user provides an unaligned sentence pair.
n-gram Document Frequency
Note: This algorithm uses similar concepts as found in tf-idf.
The alignment document frequency algorithm is the foundation for most of the other algorithms in this tool.
The algorithm proceeds as follows:
- Generate n-grams for each sentence in the unaligned sentence pair.
- Generate permutations of all
possible alignments
of n-grams between the sentences in the pair. - Filter the corpus/saved alignment indices to just those possible alignments within the unaligned sentence pair.
Perform the following calculations on the filtered and unfiltered corpus/saved alignment indices:
-
Calculate the frequency of each filtered primary n-gram over the index.
-
Calculate the ratio of alignment frequency vs primary n-gram frequency.
-
Calculate the frequency of each filtered secondary n-gram over the index.
-
Calculate the ratio of alignment frequency vs secondary n-gram frequency.
Note: It is important to compare the filtered and unfiltered frequency ratios. This is done in other algorithms.
Pseudo Code Samples:
// filter the corpus and add corpus frequency sums and ratiosprimaryNgrams = ;secondaryNgrams = ;primaryIndex = {};secondaryIndex = {};primaryNgrams; // add filtered frequency sums and ratiosObject;Object;
Phrase Plausibility
N-grams are essentially dumb phrases as it combines any contiguous tokens. A major problem is identifying if an n-gram is actually a phrase. Checking to see how common an n-gram is used over the corpus helps determine how likely it is a phrase.
Plausibility is determined by first assuming unigrams are a phrase. Larger n-grams use the calculated commonality.
NOTE: we need a way to allow 0-length n-grams in alignments, because one word may not exist in another language.
Commonality
Commonality demonstrates the likely-hood that both n-grams are phrases. And a high score indicates it is plausible that as phrases, they could be equivalent.
where x = primary n-gram corpusFrequency.
and y = secondary n-gram corpusFrequency.
commonality = min( (1-(1/x)), (1-(1/y)) )
Uniqueness
Because some words are uniquely used in the current unaligned sentence pair, this results in low or unreliable confidence scores in most algorithms.
Uniqueness is the inverse of commonality. If an n-gram is used only once or very rarely in the primary text it is likely the equivalent n-gram in the secondary text is also used very rarely.
Primary n-gram uniqueness
where x = primary n-gram filteredFrequency
and y = primary n-gram corpusFrequency
primary uniqueness = x / y
Secondary n-gram uniqueness
where x = secondary n-gram filteredFrequency
and y = secondary n-gram corpusFrequency
secondary uniqueness = x / y
Uniqueness Affinity
Once we have the uniqueness of primary and secondary n-grams we can calculate the uniqueness of the alignment.
delta = abs(primary uniqueness - secondary uniqueness)
uniqueness = 1 - delta
N-gram Length
In other algorithms n-grams are scored equally regardless of length. However, shorter n-grams are more prevalent than longer n-grams. This typically results in shorter n-grams overwhelming the output.
The weight of an alignment increases proportionally to it's length, and relative sentence coverage in primary and secondary text.
Length Ratio
Get the ratio of n-gram length to the sentence token length for each language.
Primary n-gram length ratio
where x = primary n-gram length
and y = primary sentence token length
primary length ratio = x / y
Secondary n-gram length ratio
where x = secondary n-gram length
and y = secondary sentence token length
secondary length ratio = x / y
Length Affinity
Once we have the length ratios of the two n-grams we can calculate the n-gram weight for the alignment.
delta = abs(primary length ratio - secondary length ratio)
weight = pow(1 - delta, 5)
NOTE: the numbers in the initial delta calculation were too flat so we added a power of 5 to improve the curve.
Character Length
Note This algorithm is very similar to the n-gram length algorithm. However, we ended up using a completely different algorithm.
Phrases commonly have similar length across multiple languages. Other algorithms do not account for this similarity. Therefore, these highly probable alignments are overlooked.
Character length weight is proportional to the character length of the primary n-gram and the secondary n-gram.
NOTE: it's important to see the ratio of the length vs the sentence length. Additionally, we may want a new algorithm that compares this ratio.
TODO: give this algorithm some TLC. Is there a better formula for calculating this?
where x = primary n-gram character length
and y = secondary n-gram character length
delta = abs(x - y)
largest = max(x, y)
weight = (largest - delta) / largest
Alignment Occurrences
A commonly seen pattern in translation is that word repetition in the primary text is often seen in the secondary text. Other algorithms are unable to account for this pattern in the text. Therefore, these highly probable alignments are overlooked.
The alignment occurrences weight is inversely proportional to the difference of n-gram frequency in the primary sentence vs the n-gram frequency in the secondary sentence.
Example: a 1 to 1 n-gram alignment would get the highest score. As this deviates the score will decrease.
where x = primary n-gram occurrences count in the unaligned sentence
and y = secondary n-gram occurrences count in the unaligned sentence
delta = abs(x - y)
weight = 1 / ( delta + 1)
Alignment Position
Note: this feature is currently for literal translations only. It could be enhanced to support dynamic translations in the future.
Other algorithms will not identify the relative position of n-grams in the sentence. However, in literal translations this positioning is usually proportional. Therefore, these highly probable alignments are overlooked.
We want to give extra preference to those aliments for which n-grams are in relatively proportional positions within their sentences. The alignment position weight is proportional to the difference in n-gram position in the primary and secondary sentence.
where x = primary n-gram position
and y = secondary n-gram position
delta = abs(x - y)
weight = 1 - delta
NOTE: This algorithm accounts for single occurrences within a sentence. A new algorithm would need to be created to account for multiple occurrences. We cannot combine the two algorithms, because it would skew the weight toward either single occurrences or multiple occurrences.
Frequency Ratios
TODO: we need to document this!!!
TODO we can move these calculations into
n-gram document frequency
. we also need to document these better
where x = primary filteredFrequencyRatio
and y = secondary filteredFrequencyRatio
affr = average filtered frequency ratio = (x + y) / 2
where a = primary corpusFrequencyRatio
and b = secondary corpusFrequencyRatio
acfr = average corpus frequency ratio = (a + b) / 2
where c = alignmentFrequency / primaryNgramFrequency
and d = alignmentFrequency / secondaryNgramFrequency
anfr = average ngram frequency ratio = (c + d) / 2
weight = (affr + acfr + anfr) / 3
NOTE: this is a simplified average. A weighted average would be ideal.
e.g.
(affr * weight1 + acfr * weight2 + anfr * weight3) / (weight1 + weight2 + weight3)
Static Scoring - Outlier Protection - Enhancement - Thingy
NOTE: this isn't a formal algorithm, but an enhancement of existing algorithms to prevent misrepresenting outliers.
NOTE: score based on factors that don't change once new corpus is added, used during training.
See: https://github.com/unfoldingWord-dev/tact/blob/master/tact/src/alignment.js#L214
NOTE: This could be ran after each of the affected algorithms or once when we are scoring. If we ran this in the affected algorithms we could just do the averaging there. We should add a note in the other algorithms to reference this.
Algorithms that run just on the unaligned sentence pair may not be representative of the corpus. Therefore these algorithms will suggest outliers with the same confidence as what is more common.
We want to give preference to alignments that follow the norm found within the corpus instead of relying on the single unaligned sentence pair.
Affected algorithms include:
- N-gram length
- Character length
- Alignment occurrences
- Alignment position
Retrieve unaligned sentence pairs from the corpus that contain the alignment and run each of them through the affected algorithms. Then calculate a average of each affected algorithm score.
Scoring
Calculate the average of all algorithm weights with a few exceptions.
Some algorithms should influence the final score more than others. For example the phrase plausibility should multiply the final scoring.
Weighted scores can be calculated based on user defined weights to support custom tailored results.
// where "weights" is user provided input// and "scores" contains all the algorithm scores. let weightSum = let confidence = weightsfrequency * scoresfrequency + weightsuniqueness * scoresuniqueness + weightsngramAffinity * scoresngramAffinity + weightsalignmentOccurrences * scoresalignmentOccurrences + weightsalignmentPosition * scoresalignmentPosition + weightscharacterLength * scorescharacterLengthScore / weightSum; confidence = scoresphrasePlausibility * confidence; if isSavedAlignment confidence ++;
Alignment Prediction
Current implementation:
We pick the alignment with the highest confidence score.
Then we take the tokens in the primary n-gram in that alignment and use them to filter out the remaining alignments that include those primary tokens.
Then we take the tokens in the secondary n-gram in that alignment and use them to penalize the remaining alignments that include those secondary tokens.
We will repeat this until all primary words are covered.
NOTE we have full coverage of the primary words but not of the secondary words. This is compensated by learning from the saved alignments. In the future, improvements to the prediction could boost secondary word coverage. This would be particularly helpful in the readability of machine translation output.
Once we have the best alignments for each primary n-gram we sort the alignments according to the word occurrence in the primary sentence.
NOTE: This produces a single proposed sentence. If we wanted to suggest multiple sentences or evaluate the best overall sentence confidence we could generate permutations of all sentences based on the alignments. For increased performance the above permutations might only include the most confident alignments.
Example of sort input/output:
// primary text
the book of the genealogy of Jesus Christ
// filtered alignments
[Christ][Jesus][the genealogy of][the book of]
// output
[the book of][the genealogy of][Jesus][Christ]
Saved Alignments Minimum Viable Product
Below are the components required to build a saved alignment tool for use in translationCore. The initial goal is not to implement all of the algorithms described in this document but to take the first step in bringing word alignment prediction to translationCore.
Algorithms/steps needed to implement saved alignments:
- Engine Training - for saved alignments
- Frequency Ratios
- Alignment Position
- Scoring
- Alignment Prediction
Language Configuration Settings API
Word-MAP will be extended to support the storing and retrieving of language specific configuration values via an API. This may also include saved alignment data.
Potential API Implementation in DCS
Using DCS we could define a repository naming convention and repository structure that would allow the optimal storage of Word-MAP configuration variables and alignment data. For an example:
Repository: jag3773/word-map_en-hbo
The above would specify that the configuration information is for the English (en
) and Biblical Hebrew (hbo
) language pairs.
Repository Contents:
- config.json
- saved_alignments.json
The config.json
file would be a small JSON array that could hold the configurable values for the Prediction Algorithms. A provisonal example:
{
"source_language": {
"direction": "rtl",
"identifier": "hbo",
"title": "Ancient Hebrew"
},
"target_language": {
"direction": "ltr",
"identifier": "en",
"title": "English"
},
"config": [
{
"feature": "char_length",
"option1": 20,
"option2": 25
},
...
{
"feature": "length_affinity",
"option1": 3,
"option2": 5
}
],
"word-map_version": "1"
}
The saved_alignments.json
would be a JSON array of saved alignment pairs.
Benefits
A benefit of this approach is that it allows users to backup their configuration and reload it on another system, using the existing account. A second advantage is that user's would be able to share configuration and saved alignments with one another, in the same manner in which they share projects. A third benefit is that it allows Word-MAP to define a cascading list of defaults, possibly looking at the Door43-Catalog organization first, then checking a user's repo for overrides.
Limitations
This approach requires a DCS account. This is not a problem for tC but other software using Word-MAP may not have support for DCS accounts. Such software would still be able to read configuration data but could not write it unless they add support for DCS user authentication.