@gmod/gtf
GTF or the General Transfer Format is identical to GFF version2. This module was created to read and write GTF data. This module aims to be a complete implementation of the GTF specification.
- streaming parsing and streaming formatting
- creates transcript features with children_features
- only compatible with GTF
Note: For JBrowse, we generally encourage GFF3 over GTF
For GFF3, checkout @gmod/gff-js package found here
Install
$ npm install --save @gmod/gtf
Usage
import gtf from '@gmod/gtf'
// parse a file from a file name
gtf.parseFile('path/to/my/file.gtf', { parseAll: true })
.on('data', data => {
if (data.directive) {
console.log('got a directive',data)
}
else if (data.comment) {
console.log('got a comment',data)
}
else if (data.sequence) {
console.log('got a sequence from a FASTA section')
}
else {
console.log('got a feature',data)
}
})
// parse a stream of GTF text
const fs = require('fs')
fs.createReadStream('path/to/my/file.gtf')
.pipe(gtf.parseStream())
.on('data', data => {
console.log('got item',data)
return data
})
.on('end', () => {
console.log('done parsing!')
})
// parse a string of gtf synchronously
let stringOfGTF = fs
.readFileSync('my_annotations.gtf')
.toString()
let arrayOfThings = gtf.parseStringSync(stringOfGTF)
// format an array of items to a string
let stringOfGTF = gtf.formatSync(arrayOfThings)
// format a stream of things to a stream of text.
// inserts sync marks automatically.
// note: this could create new gtf lines for transcript features
myStreamOfGTFObjects
.pipe(gtf.formatStream())
.pipe(fs.createWriteStream('my_new.gtf'))
// format a stream of things and write it to
// a gtf file. inserts sync marks
// note: this could create new gtf lines for transcript features
myStreamOfGTFObjects
.pipe(gtf.formatFile('path/to/destination.gtf')
Object format
features
Because GTF can not handle a 3 level hierarchy (gene -> transcript -> exon), we parse GTF by creating transcript features with children features.
We do not create features from the gene_id. Values that are .
in the GTF are
null
in the output.
ctgA bare_predicted CDS 10000 11500 . + 0 transcript_id "Apple1";
Note: that is creates an additional transcript feature from the transcript id when featureType is not 'transcript'. It will then create a child CDS feature from the line of GTF shown above.
[
[
{
"seq_name": "ctgA",
"source": "bare_predicted",
"featureType": "transcript",
"start": 10000,
"end": 11500,
"score": null,
"strand": "+",
"frame": "0",
"attributes": { "transcript_id": [ "\"Apple1\"" ] },
"child_features": [[
{
"seq_name": "ctgA",
"source": "bare_predicted",
"featureType": "CDS",
"start": 10000,
"end": 11500,
"score": null,
"strand": "+",
"frame": "0",
"attributes": { "transcript_id": [ "\"Apple1\"" ] },
"child_features": [],
"derived_features": []
}
]],
"derived_features": []
}
]
]
directives, comments, sequences
parseDirective("##gtf\n")
// returns
{
"directive": "gtf",
}
parseComment('# hi this is a comment\n')
// returns
{
"comment": "hi this is a comment"
}
//These come from any embedded `##FASTA` section in the GTF file.
{
"id": "ctgA",
"description": "test contig",
"sequence": "ACTGACTAGCTAGCATCAGCGTCGTAGCTATTATATTACGGTAGCCA"
}
API
Table of Contents
parseStream
Parse a stream of text data into a stream of feature, directive, and comment objects.
Parameters
-
options
Object optional options object (optional, default{}
)-
options.encoding
string text encoding of the input GTF. default 'utf8' -
options.parseAll
boolean default false. if true, will parse all items. overrides other flags -
options.parseFeatures
boolean default true -
options.parseDirectives
boolean default false -
options.parseComments
boolean default false -
options.parseSequences
boolean default true -
options.bufferSize
Number maximum number of GTF lines to buffer. defaults to 1000
-
Returns ReadableStream stream (in objectMode) of parsed items
parseFile
Read and parse a GTF file from the filesystem.
Parameters
-
filename
string the filename of the file to parse -
options
Object optional options object-
options.encoding
string the file's string encoding, defaults to 'utf8' -
options.parseAll
boolean default false. if true, will parse all items. overrides other flags -
options.parseFeatures
boolean default true -
options.parseDirectives
boolean default false -
options.parseComments
boolean default false -
options.parseSequences
boolean default true -
options.bufferSize
Number maximum number of GTF lines to buffer. defaults to 1000
-
Returns ReadableStream stream (in objectMode) of parsed items
parseStringSync
Synchronously parse a string containing GTF and return an arrayref of the parsed items.
Parameters
Returns Array array of parsed features, directives, and/or comments
formatSync
Format an array of GTF items (features,directives,comments) into string of GTF. Does not insert synchronization (###) marks. Does not insert directive if it's not already there.
Parameters
-
items
Returns String the formatted GTF
formatStream
Format a stream of items (of the type produced by this script) into a stream of GTF text.
Inserts synchronization (###) marks automatically.
Parameters
-
options
Object
formatFile
Format a stream of items (of the type produced by this script) into a GTF file and write it to the filesystem.
Inserts synchronization (###) marks and a ##gtf directive automatically (if one is not already present).
Parameters
-
stream
ReadableStream the stream to write to the file -
filename
String the file path to write to -
options
Object (optional, default{}
)
Returns Promise promise for the written filename
util
Table of Contents
- util
- unescape
- _escape
- escapeColumn
- parseAttributes
- parseFeature
- parseDirective
- formatAttributes
- formatFeature
- formatDirective
- formatComment
- formatSequence
- formatItem
util
unescape
Unescape a string/text value used in a GTF attribute. Textual attributes should be surrounded by double quotes source info: https://mblab.wustl.edu/GTF22.html https://en.wikipedia.org/wiki/Gene_transfer_format
Parameters
-
s
String
Returns String
_escape
Escape a value for use in a GTF attribute value.
Parameters
-
regex
-
s
String
Returns String
escapeColumn
Escape a value for use in a GTF column value.
Parameters
-
s
String
Returns String
parseAttributes
Parse the 9th column (attributes) of a GTF feature line.
Parameters
-
attrString
String
Returns Object
parseFeature
Parse a GTF feature line.
Parameters
-
line
String returns the parsed line in an object
parseDirective
Parse a GTF directive/comment line.
Parameters
-
line
String
Returns Object the information in the directive
formatAttributes
Format an attributes object into a string suitable for the 9th column of GTF.
Parameters
-
attrs
Object
formatFeature
Format a feature object or array of feature objects into one or more lines of GTF.
Parameters
-
featureOrFeatures
formatDirective
Format a directive into a line of GTF.
Parameters
-
directive
Object
Returns String
formatComment
Format a comment into a GTF comment. Yes I know this is just adding a # and a newline.
Parameters
-
comment
Object
Returns String
formatSequence
Format a sequence object as FASTA
Parameters
-
seq
Object
Returns String formatted single FASTA sequence
formatItem
Format a directive, comment, or feature, or array of such items, into one or more lines of GTF.
Parameters
Notes and resources
- This is an adaptation of the JBrowse GTF parser
- GTF docs
License
MIT © Robert Buels