This library provides utilities for injecting alignment data from WordMAP into USFM3 files.
From the command line:
npm i wordmap-usfm -g wordmap-usfm --help
As a module:
npm i wordmap-usfm
;...const alignedUSFM = ;
Alignment JSON Data Structure
The intent is to have a single file as input that allows full round trip conversion to USFM 3 without any loss. One of the resources is intended to be the source/primary text, the second resource is the one that is the target language of the USFM file. The target language is what would typically be shown in the USFM file without any alignment data.
In theory the data structure is extensible to allow for other metadata per word or token and potentially more than two languages although that may not lend itself well to USFM.
Top level attributes
The top level attributes of the data structure are
conformsTo attribute specifies which version of the spec was used for the generation of the alignment file. Over time we plan to make changes to the alignment specification and are using Semantic Versioning starting with version 0.1 for this release.
The top level attribute
metadata stores information about the content stored in the file.
metadata.modified attribute is a unix timestamp of the last modification of the file.
Any time an edit is made to the content of the file this timestamp should be updated so that users of the file can keep up with the latest version of the data.
metadata.resources attribute is an object whose keys are the names of resources and the values are the metadata describing the resources. These keys/names are used later in the segments section of the file to specify which resource the respective content belongs to. The expected attributes for each resource are
version:String. This information will be used in generating the headers of the USFM3 file output.
- One text will be specified as the language of the USFM file and the other will be aligned to it as USFM3 milestones.
- One of the resource's text of each segment will be used as the raw USFM string for the verse for USFM3 generation.
- The tokens in the corresponding segment of the other resource will be aligned to the tokens found in the raw string of the first.
segments attribute is an array of individual segments of the resources grouped together at the aligned segment level.
segments[n].resources attribute is an object of which the keys correspond to the keys in the
metadata.resources. In the example below,
r1 are the resource keys.
The values of the resource keys are an object whose attributes are
textattribute holds the raw string of the segment.
tokensattribute is an array of individual tokens as strings.
- Later spec revisions will include tokens represented as data objects.
metadataattribute is an object that holds data about the segment.
- Currently only requires a
contextIdattribute that identifies where the segment belongs, such as the verse identifier.
metadata.contextIdat each resource allows for alignments to exist between different versification systems.
- Currently only requires a
segments[n].alignments attribute is an array of individual alignments between tokens of the resources at the same level.
Each alignment is an object with the attributes of
verified:Boolean, and the
[key]:Object that correspond with the resources.
scoreattribute holds the confidence of this specific alignment generated by the alignment tool.
verifiedattribute holds the boolean of whether or not the alignment was generated or approved by a human.
- The remaining attributes hold an array of indexes that correspond to their string counterparts in the
segments[n].resources[key].tokensin the respective
keyof the array.
The example below shows an alignment of the above tokens. Note that alignments to null can be represented as not being present at all. Optionally they can be represented as indexes on one side and an empty array on the other.
The example below is fabricated to show many to many, many to one, one to many, one to one, none to many, one to none, one to many verified, and many to one verified in a respective order. The non-verified are machine aligned and verified are human aligned or confirmed.
- Support extracting alignment data from USFM3. This will be useful when importing usfm into tC.
- Support alignments that span verses
- Support alignments that span chapters