tagstream wraps node-expat in a through-stream so you can perform
XML parsing as part of a pipeline using event-stream's various
fun operators, e.g. map
and join
, and your own through
-streams.
My use case was accepting terabytes of XML over fast networks, extracting objects, filtering them, putting them back into a text based format, then feeding them to clients over slow networks. Proper backpressure handling was compulsory.
Alternatives
-
cheerio can traverse your XML and hand you objects to deal with if your XML fits in memory.
-
sax'
createStream
lets you stream XML text in and streams the same XML text back out. Meanwhile, it emits the events tagstream would emit as data. If you adapt sax to do what tagstream does but publish the adaptation code as a library, the library will be pretty much exactly tagstream but with sax instead of node-expat. -
JSONStream is somewhat along these lines, but the input is JSON and the output is, sensibly, parsed objects out of the JSON stream. Tagstream's input is XML and the output is XML parser events.
Usage
Drop a tagstream
between your XML data text and whatever needs to
deal with events the XML parser is throwing.
var request = require('request'),
es = require('element-stream'),
tagstream = require('tagstream');
es.pipeline(
request(urlOfBigXML),
tagstream(),
// through-streams transforming events into whatever...
output
);
See test/demo.js
for one example, which extracts the text from between
XML tags. If you replace input
with a fast stream and output
with a
slow stream, you can feed in 1TB of XML and watch the output stream
control how fast the input stream is read.
Data Object Detail
Each datum carried by the data
event out of the tag stream is an array.
Depending on the XML event, it'll be one of these:
- {
what
:"start"
,tag
: tagName,attrs
: { ... } } - {
what
:"end"
,tag
: tagName } - {
what
:"text"
,text
: text }
Testing
npm install
to ensure tap
is present, then npm test
.