🍌
graphy.js
A faster-than-lightning, asynchronous, streaming RDF deserializer. It implements the RDFJS Representation Interfaces and natively parses Turtle, TriG, N-Triples, and N-Quads.
A future release is aiming to provide a query-like JavaScript API for traversing RDF graphs. It is currently under development. JSON-LD support has also been suspended until the expand algorithm is re-implemented.
Contents ### Parse serialized RDF graphs faster than lightning! This library boasts a set of high performance parsers, each one is specialized for its corresponding serialization format. Consider the following benchmark test:
DBpedia's 2015-04 English persondata.nt in N-Triples format:
Count how many triples are inusing graphy:
console.time('g'); const graphy = require('graphy'); let c_triples = 0;
let input = fs.createReadStream('persondata_en.nt');
graphy.nt.parse(input, {
data(triple) {
c_triples += 1;
},
end() {
console.timeEnd('g');
console.log(`${c_triples} triples parsed`);
},
});
versus N3.js v0.4.5:
console.time('n'); const n3 = require('n3'); let c_triples = 0;
let input = fs.createReadStream('persondata_en.nt');
new n3.Parser(/* faster w/o format*/).parse(input, function(err, triple) {
if(triple) {
c_triples += 1;
}
else {
console.timeEnd('n');
console.log(`${c_triples} triples parsed`);
}
});
Benchmark Results:
Each benchmark listed was the best of 10 trials:
DBpedia file | # quads | N3.js time | "" velocity | graphy time | "" velocity | speedup |
---|---|---|---|---|---|---|
persondata_en.nt | 8,397,081 | 13859 ms | 605,871 op/s | 4792 ms | 1,752,276 op/s | 2.892x |
instance-types_en.nq | 5,647,972 | 12440 ms | 453,997 op/s | 6478 ms | 871,755 op/s | 1.920x |
redirects_en.ttl | 6,831,505 | 12098 ms | 564,670 op/s | 7000 ms | 975,810 op/s | 1.728x |
persondata_en.ttl | 8,397,081 | 15740 ms | 533,463 op/s | 9287 ms | 904,084 op/s | 1.694x |
article-categories_en.nq | 20,232,709 | 46386 ms | 436,172 op/s | 27561 ms | 734,098 op/s | 1.683x |
What's the catch? See Performance details
Piping streams to transform
Streams can also be piped into graphy to use it as a Transform.
const request = require('request');
request(path_to_remote_data)
.pipe(graphy.ttl.parse({
data(triple, output) {
let string_chunk = do_something(triple);
output.push(string_chunk);
},
})
.pipe(fs.createWriteStream(path_to_output_file));
Setup
Install it as a dependency for your project:
$ npm install --save graphy
API
Introduction
The module does not require
its dependencies until they are explicitly accessed by the user (i.e., they are lazily loaded), so only what is requested will be loaded (the same goes for browsers, so long as you are using browserify to bundle your project).
However, no matter which component of graphy you are loading, the DataFactory methods will always be available. These allow you to create new instances of RDF terms for comparing, injecting, serializing, or using alongside their parser-derived siblings.
Pseudo-Datatypes:
Throughout this API document, the following datatypes are used to represent expectations imposed on primitive-datatyped parameters to functions, exotic uses of primitives in class methods (in future versions), and so forth:
-
hash
- refers to a simpleobject
with keys and values (e.g.{key: 'value'}
) -
key
- refers to astring
used for accessing an arbitrary value in ahash
-
list
- refers to a one-dimensionalArray
containing only elments of the same type/class
const graphy = require('graphy');
Methods:
-
graphy.namedNode (iri: string)
-
returns a new NamedNode
-
example:
graphy.namedNode('ex://test')+''; // '<ex://test>'
-
-
graphy.literal(contents: string[, datatype_or_lang: string|langtag])
-
returns a new Literal with optional
datatype_or_lang
-
example:
graphy.literal('"')+''; // '"\""^^<http://www.w3.org/2001/XMLSchema#string>' graphy.literal('42', 'ex://datatype')+''; // '"42"^^<ex://datatype>' graphy.literal('hello Mars!', '@en')+''; // '"hello Mars!"@en'
-
-
graphy.blankNode
:function
-
{function}
()
-- no args constructor will generate a new UUID4 in order to thwart label collisions -
{function}
(label: string)
-- uses the givenlabel
-
{function}
(label_manager: Parser|Graph|other)
-- callslabel_manager.next_label()
to generate a new label. Alwyas better to use this method than the no-args version because it guarantees collision-free labels and is also more efficient.-
returns a new BlankNode
-
example:
graphy.blankNode()+''; // '_:439e14ae_1531_4683_ac96_b9f091da9595' graphy.blankNode('label')+''; // '_:label' graphy.nt.parse('<a> <b> <c> .', { data() { graphy.blankNode(this)+''; }, // _:g0 });
-
-
{function}
-
graphy.defaultGraph()
-
returns a new DefaultGraph
-
example:
graphy.defaultGraph()+''; // '' graphy.defaultGraph().termType; // 'DefaultGraph'
-
-
graphy.triple(subject: Term, predicate: Term, object: Term)
- returns a new Triple
-
graphy.quad(subject: Term, predicate: Term, object: Term, graph: Term)
- returns a new Quad
Parsing
This section documents graphy's high performance parser, which can be used directly for parsing a readable stream, transforming a readable stream to a writable stream, and parsing static strings. Each parser also allows pausing and resuming the stream.
- Parse events
- Stream Control
- Parse options
- Parsers
Parse Events
The parsers are engineered to run as fast as computerly possible. For this reason, they do not extend EventEmitter, which normally allows event handlers to bind via .on()
calls. Instead, any event handlers must be specified during a call to the parser. The name of an event is given by the key of a hash
that gets passed as the config
, where the value of each entry is the event's callback function.
For example:
const parse_trig = graphy.trig.parse;
parse_trig(input, {
data(quad) { // 'data' event handler
// ..
},
error(parse_error) { // 'error' event handler
// ..
},
});
The
prefixes
argument is a hash of the final mappings at the time the end of the input was reached. It is only available forgraphy.ttl.parse
andgraphy.trig.parse
Stream Control
For any of the event callbacks listed above, you can control the stream's state and temporarily suspend events from being emitted by making calls through this
.
For example:
parse_ttl(input, {
data(triple) {
if(triple.object.isNamedNode) {
this.pause(); // no events will be emitted ...
asyncFunction(() => {
this.resume(); // ... until now
});
}
},
});
this.pause()
- Immediately suspends any
data
,prefix
orbase
events from being emitted untilthis.resume()
is called. Also pauses the readable input stream until more data is needed.- Some of the parsers will finish parsing the current chunk of stream data before the call to
this.pause()
returns, others will finish parsing the current production. When this happens, the parser queues any would-be events to a buffer which will be released to the corresponding event callbacks oncethis.resume()
is called.
- Some of the parsers will finish parsing the current chunk of stream data before the call to
this.resume()
- Resumes firing event callbacks. Once there are no more queued events, the parser will automatically resume the readable input stream.
this.stop()
-
Immediately stops parsing and permanently unbinds all event liteners so that no more events will be emitted, not even
end
. Will also attempt to close the input stream if it can (calls.destroy
onfs.ReadStream
objects). Useful for exitting read streams of large files prematurely. -
example:
parse_ttl(fs.createReadStream('input.ttl'), { data(triple) { // once we've found what we're looking for... if(triple.object.isLiteral && /find me/.test(triple.object.value)) { this.stop(); } }, });
Parse Options
In addition to specifying events, the parser function's config
parameter also accepts a set of options:
#### graphy.ttl.parse The Turtle parser: ```js const parse_ttl = graphy.ttl.parse; ```
The parse function (in the example above, parse_ttl
) has three variants:
parse_ttl(input: string, config: hash)
Synchronously parses the given input
string. It supports the event handlers: data
, prefix
, base
, error
and end
. If a call to this.pause()
is made during event callbacks, the operation becomes asynchronous.
Example:
parse_ttl('@prefix : <http://ex.org>. ex:s ex:p: ex:o.', {
data(triple) {
console.log(triple+''); // '<http://ex.org/s> <http://ex.org/p> <http://ex.org/o> .'
},
});
parse_ttl(input: Stream, config: hash)
Asynchronously parses the given input
stream. It supports the event handlers: data
, prefix
, base
, error
and end
.
Example:
// download images from DBpedia
let foaf_depiction = 'http://xmlns.com/foaf/0.1/depiction';
parse_ttl(fs.createReadStream('input.ttl'), {
data(triple) { // for each triple...
if(triple.predicate.startsWith(foaf_depiction)) {
download_queue.push(triple.object.value); // download the image
if(download_queue.is_full) { // if there's too many requests...
this.pause();
download_queue.once('available', () => {
this.resume();
});
}
}
},
});
Example:
// convert a .ttl Turtle file into a .nt N-Triples file
fs.createReadStream('input.ttl', 'utf8')
.pipe(parse_ttl({
data(triple, output) { // for each triple...
// cast it to a string to produce N-Triples canonicalized form
output.push(triple+'');
},
}))
.pipe(fs.createWriteStream('output.nt'));
#### graphy.trig.parse The TriG parser: ```js const parse_trig = graphy.trig.parse; ```
The parse function (in the example above, parse_trig
) has three variants:
parse_trig(input: string, config: hash)
Synchronously parses the given input
string. It supports the event handlers: data
, graph
, prefix
, base
, error
and end
. If a call to this.pause()
is made during event callbacks, the operation becomes an asynchronous.
Example:
parse_trig('@prefix : <http://ex.org>. ex:g { ex:s ex:p: ex:o. }', {
data(quad) {
console.log(quad+''); // '<http://ex.org/s> <http://ex.org/p> <http://ex.org/o> <http://ex.org/g> .'
},
});
parse_trig(input: stream, config: hash)
Asynchronously parses the given input
stream object. It supports the event handlers: data
, graph
, prefix
, base
, error
and end
.
Example:
// only inspect triples within a certain graph
let inspect = false;
parse_trig(input, {
graph_open(graph) {
if(graph.value === 'http://target-graph') inspect = true;
},
graph_close(graph) {
if(inspect) inpsect = false;
},
data(quad) {
if(inspect) { // much faster than comparing quad.graph to a string
// do something with triples
}
},
});
parse_trig(config: hash)
Creates a Transform
for simultaneously reading input data and writing output data. It supports the event handlers: graph
, prefix
, base
, error
, end
and an extended version of the data
event handler that allows the callback to write output data by pushing strings to the callback function's output
argument. For each chunk that is read from the input, the parser will join all the strings in this output
array (by an empty character) and then write that to the output.
Example:
// convert a .trig TriG file into a .nq N-Quads file
fs.createReadStream('input.trig', 'utf8')
.pipe(parse_trig({
data(quad, output) { // for each quad...
// cast it to a string to produce its canonicalized form
output.push(quad+'');
},
}))
.pipe(fs.createWriteStream('output.nq'));
#### graphy.nt.parse The N-Triples parser: ```js const parse_nt = graphy.nt.parse; ```
The parse function (in the example above, parse_nt
) has three variants:
parse_nt(input: string, config: hash)
Synchronously parses the given input
string. It supports the event handlers: triple
, error
and end
.
parse_nt(input: stream, config: hash)
Asynchronously parses the given input
stream object. It supports the event handlers: triple
, error
and end
.
parse_nt(config: hash)
Creates a Transform
for simultaneously reading input data and writing output data. It supports the event handlers: error
, end
and an extended version of the triple
event handler that allows the callback to write output data by pushing strings to the callback function's output
argument. For each chunk that is read from the input, the parser will join all the strings in this output
array (by an empty character) and then write that to the output.
#### graphy.nq.parse The N-Quads parser: ```js const parse_nq = graphy.nq.parse; ```
The parse function (in the example above, parse_nq
) has three variants:
parse_nq(input: string, config: hash)
Synchronously parses the given input
string. It supports the event handlers: quad
, error
and end
.
parse_nq(input: stream, config: hash)
Asynchronously parses the given input
stream object. It supports the event handlers: quad
, error
and end
.
parse_nq(config: hash)
Creates a Transform
for simultaneously reading input data and writing output data. It supports the event handlers: error
, end
and an extended version of the quad
event handler that allows the callback to write output data by pushing strings to the callback function's output
argument. For each chunk that is read from the input, the parser will join all the strings in this output
array (by an empty character) and then write that to the output.
RDF Data
The following section documents how graphy represents RDF data in its various forms.
### abstract **Term** implements @RDFJS Term An abstract class that represents an RDF term by implementing the [RDFJS Term interface](https://github.com/rdfjs/representation-task-force/blob/master/interface-spec.md#term). If you are looking to create an instance of Term, see the [graphy DataFactory](#graphy-factory).Properties: (implementing RDFJS Term interface)
-
.termType
:string
-- either'NamedNode'
,'BlankNode'
,'Literal'
or'DefaultGraph'
-
.value
:string
-- depends on the type of term; could be the content of a Literal, the label of a BlankNode, or the IRI of a NamedNode
Methods: (implementing RDFJS Term interface)
-
.equals(other: Term)
-- tests if this term is equal toother
-
returns a
boolean
-
returns a
-
.toCanonical()
-- produces an N-Triples canonical form of the term-
returns a
string
-
returns a
Methods:
-
.valueOf()
-- gets called when cast to astring
. It simply returns.toCanonical()
-
returns a
string
-
example:
let hey = graphy.namedNode('hello'); let you = graphy.literal('world!', '@en'); console.log(hey+' '+you); // '<hello> "world!"@en'
-
### **NamedNode** extends Term implements @RDFJS NamedNode A class that represents an RDF named node by implementing the [RDFJS NamedNode interface](https://github.com/rdfjs/representation-task-force/blob/master/interface-spec.md#namednode-extends-term)
Properties: (inherited from Term & implementing RDFJS NamedNode)
-
.termType
:string
='NamedNode'
-
.value
:string
-- the IRI of this named node
Properties:
-
.isNamedNode
:boolean
=true
-- the preferred and fastest way to test for NamedNode term types
Methods:
### **BlankNode** extends Term implements @RDFJS BlankNode A class that represents an RDF blank node by implementing the [RDFJS BlankNode interface](https://github.com/rdfjs/representation-task-force/blob/master/interface-spec.md#blanknode-extends-term)
Properties: (inherited from Term & implementing RDFJS NamedNode)
-
.termType
:string
='BlankNode'
-
.value
:string
-- the label of this blank node (i.e., without leading'_:'
)
Properties:
-
.isBlankNode
:boolean
=true
-- the preferred and fastest way to test for BlankNode term types
Methods:
### **Literal** extends Term implements @RDFJS Literal A class that represents an RDF literal by implementing the [RDFJS Literal interface](https://github.com/rdfjs/representation-task-force/blob/master/interface-spec.md#literal-extends-term)
Properties: (inherited from Term & implementing RDFJS Literal interface)
-
.termType
:string
='Literal'
-
.value
:string
-- the content of this literal
Properties: (implementing RDFJS Literal interface)
-
.datatype
:string
-- the datatype IRI of this literal -
.language
:string
-- the language tag associated with this literal (empty string if it has no language)
Properties:
-
.isLiteral
:boolean
=true
-- the preferred and fastest way to test for Literal term types
Notice: Some serialization formats allow for "simple literals", which do not have an explicit datatype specified. These literals have an implicit datatype of
xsd:string
- however, you can test if an instance of Literal was created with an explicit datatype by usingObject.hasOwnProperty
to discover ifdatatype
is defined on the instance object itself or in its protoype chain:
let simple = graphy.literal('no datatype');
simple.datatype; // 'http://www.w3.org/2001/XMLSchema#string'
simple.hasOwnProperty('datatype'); // false
let typed = graphy.literal('yes datatype', 'ex://datatype');
typed.datatype; // 'ex://datatype'
typed.hasOwnProperty('datatype'); // true
let langed = graphy.literal('language tag', '@en');
simple.datatype; // 'http://www.w3.org/1999/02/22-rdf-syntax-ns#langString'
### **IntegerLiteral** extends Literal A class that represents an RDF literal that was obtained by deserializing a syntactic integer.
Only available in Turtle and TriG
Properties: (inherited from / overriding Literal)
- ... those inherited from Literal
-
.datatype
:string
='http://www.w3.org/2001/XMLSchema#integer'
Properties:
-
.number
:number
-- the parsed number value obtained viaparseInt
-
.isNumeric
:boolean
=true
### **DecimalLiteral** extends Literal A class that represents an RDF literal that was obtained by deserializing a syntactic decimal.
Only available in Turtle and TriG
Properties: (inherited from / overriding Literal)
- ... those inherited from Literal
-
.datatype
:string
='http://www.w3.org/2001/XMLSchema#decimal'
Properties:
-
.number
:number
-- the parsed number value obtained viaparseFloat
-
.isNumeric
:boolean
=true
### **DoubleLiteral** extends Literal A class that represents an RDF literal that was obtained by deserializing a syntactic double.
Only available in Turtle and TriG
Properties: (inherited from / overriding Literal)
- ... those inherited from Literal
-
.datatype
:string
='http://www.w3.org/2001/XMLSchema#double'
Properties:
-
.number
:number
-- the parsed number value obtained viaparseFloat
-
.isNumeric
:boolean
=true
Example:
graphy.ttl.parse('<a> <b> 0.42e+2 .', {
data(triple) {
triple.object.value; // '0.42e+2'
triple.object.number; // 42
triple.object.isNumeric; // true
triple.object.datatype; // 'http://www.w3.org/2001/XMLSchema#double'
},
});
### **BooleanLiteral** extends Literal A class that represents an RDF literal that was obtained by deserializing a syntactic boolean.
Only available in Turtle and TriG
Properties: (inherited from / overriding Literal)
- ... those inherited from Literal
-
.datatype
:string
='http://www.w3.org/2001/XMLSchema#boolean'
Properties:
-
.boolean
:boolean
-- the boolean value, eithertrue
orfalse
-
.isBoolean
:boolean
=true
### **DefaultGraph** extends Term implements @RDFJS DefaultGraph A class that represents the default graph by implementing the [RDFJS DefaultGraph interface](https://github.com/rdfjs/representation-task-force/blob/master/interface-spec.md#defaultgraph-extends-term)
Properties: (inherited from Term & implementing RDFJS DefaultGraph interface)
-
.termType
:string
='DefaultGraph'
-
.value
:string
=''
-- always an empty string
Properties:
-
.isDefaultGraph
:boolean
=true
-- the preferred and fastest way to test for DefaultGraph term types
### **Quad** implements @RDFJS Quad A class that represents an RDF triple/quad by implementing the [RDFJS Quad interface](https://github.com/rdfjs/representation-task-force/blob/master/interface-spec.md#quad)
Properties: (implementing RDFJS Quad interface)
-
.subject
:[NamedNode|BlankNode]
-
.predicate
:NamedNode
-
.object
:Term
-
.graph
:[NamedNode|BlankNode|DefaultGraph]
Methods: (implementing RDFJS Quad interface)
-
.equals(other: Quad[, ignore_graph: boolean])
- tests if
other
Quad is equal to this one, optionally ignoring the graph ifignore_graph
is truthy. -
returns a
boolean
- tests if
-
.toCanonical()
-- produces an N-Triples canonical form of the Quad.
Methods:
-
.valueOf()
-- gets called when cast to astring
. It simply returns.toCanonical()
-
returns a
string
-
example:
graphy.quad( graphy.namedNode('subject'), graphy.namedNode('predicate'), graphy.namedNode('object'), graphy.namedNode('graph'), )+''; // '<subject> <predicate> <object> <graph> .'
-
### **Triple** aliases Quad implements @RDFJS Triple A class that represents an RDF triple by implementing the [RDFJS Triple interface](https://github.com/rdfjs/representation-task-force/blob/master/interface-spec.md#quad). Same as `Quad` except that `.graph` will always be a [DefaultGraph](#defaultgraph).
Properties: (aliasing Quad & implementing RDFJS Triple interface)
-
.graph
:DefaultGraph
- ... and those in Quad
Methods: (aliasing Quad & implementing RDFJS Triple interface)
- ... those in Quad
Compatibility
Lexing input this fast is only possible by taking advantage of an ECMAScript 2015 feature (the sticky "y" RegExp flag) which is not yet implemented in all browsers, even though it is now the current standard (see compatibility table at row RegExp "y" and "u" flags
). It also means that only Node.js versions >= 6.0 are supported, which will also soon be the new LTS anyway. Failure to use a modern engine with graphy will result in:
SyntaxError: Invalid flags supplied to RegExp constructor 'y'
at new RegExp (native)
....
Performance
High performance has a cost, namely that this module is not a validator, although it does handle parsing errors. Full validation will likely never be implemented in graphy since it only slows down parsing and because N3.js already does a fine job at it.
#### Parser is intended for valid syntax only This tool is intended for serialized formats that were generated by a machine. Quite simply, it does not check the contents of certain tokens for "invalid" characters; such as those found inside of: IRIs, prefixed names, and blank node labels.For example:
<a> <iri refs aren't supposed to have spaces> <c> .
Is technically not valid TTL. However, graphy will not emit any errors. Instead, it will emit the following triple:
{
subject: {value: 'a', termType: 'NamedNode', ...},
predicate: {value: 'iri refs aren't supposed to have spaces', termType: 'NamedNode', ...},
object: {value: 'c', termType: 'NamedNode', ...},
graph: {value: '', ..}
}
The parser does however handle unexpected token errors that violate syntax. For example:
<a> _:blank_nodes_cannot_be_predicates <c> .
Emits the error:
`_:blank_nodes_cannot_be_predicates `
^
expected pairs. failed to parse a valid token starting at "_"
You can check out the test case in ./test
Debugging
If you are encountering parsing errors or possibly a bug with graphy, simply change your graphy import statement to const graphy = require('graphy/es6')
to load the unminized, unmangled versions of the parsers which will yield more verbose parser errors (such as the state name which gets mangled during minimization).
In case you are testing against N-Triples canonicalized forms, bear in mind the following things that graphy does:
- Nested blank node property lists and RDF collections are emitted in the order they appear, rather than from the outside-in.
- Anonymous blank nodes (e.g.,
[]
) are assigned a label starting with the characterg
, rather thanb
. This is done in order to minimize the time spent testing and renaming conflicts with common existing blank node labels in the document (such as_:b0
,_:b1
, etc.).
Why bother checking for errors at all?
Stumbling into an invalid token does not incur a performance cost since it is the very last branch in a series of if-else jumps. It is mainly the characters inside of expected tokens that are at risk of sneaking invalid characters through. This is due to the fact that the parser uses the simplest regular expressions it can to match tokens, opting for patterns that only exclude characters that can belong to the next token, rather than specifying ranges of valid character inclusion. This compiles DFAs that require far fewer states with fewer instructions, hence less CPU time.
How graphy optimizes
Optimizations are acheived in a variety of ways, but perhaps the most general rule that guides the process is this: graphy tries parsing items based on benefit-cost ratios in descending order. "Benefit" is represented by the probability that a given route is the correct one (derived from typical document freqency), where "Cost" is represented by the amount of time it takes to test whether or not a given route is the correct one.
For example, double quoted string literals are more common in TTL documents than single quoted string literals. For this reason, double quoted strings literals have a higher benefit -- and since testing time for each of these two tokens is identical, then they have the same cost. Therefore, if we test for double quoted string literals before single quoted string literals, we end up making fewer tests a majority of the time.
However, the optimization doesn't stop there. We can significantly cut down on the cost of parsing a double quoted string literal if we know it does not contain any escape sequences. String literals without escape sequences are not significantly more common than literals with them, so the benefit is not very high - however, the cost savings is enormous (i.e., the ratio's denominator shrinks) and so it outweighs the benefit thusly saving time overall.
License
ISC © Blake Regalia