wiki-transform
Stream transforming raw XML into wiki pages
wiki-transform provides a WikiTransform
hybrid stream for NodeJS: it takes XML chunks and outputs WikiPage
objects.
It is an extremely fast stream, because it internally uses a SAX parser combined with a hyper-minimalist algorithm.
Last but not least, WikiTransform
is a standard stream, so you can use it in pipelines, or you can manually control it via the usual stream methods.
Installation
npm install @giancosta86/wiki-transform
or
yarn add @giancosta86/wiki-transform
The public API entirely resides in the root package index, so you shouldn't reference specific modules.
Usage
Just create a new instance of WikiTransform
- maybe passing options. You will then be able to:
-
add it to a pipeline - via a chain of
.pipe()
method calls, or via thepipeline()
function provided by NodeJS -
call its standard methods - like
.write()
,.end()
,.on()
and.once()
Supported format
WikiTransform
will create a WikiPage
object whenever it encounters the following XML pattern:
<page>
<title>The title</title>
<text>The text</text>
</page>
with the following rules:
-
The order of the subfields is ignored
-
Additional subfields are ignored
-
Ancestor nodes are ignored
-
Whitespace is ignored
-
XML entities like
>
are substituted with their actual characters -
CDATA blocks within significant fields are correctly parsed, and can be freely mixed with non-CDATA text
-
in lieu of
<page>
, the root tag can be something else - just pass the related opening tag (without angle brackets) to thepageTag
constructor option
Please, note: this library does NOT support nested tags within the <text>
element! To handle them, you should instead rely on dedicated SAX parsing.
Example
This basic but fairly general-purpose function:
-
extracts wiki pages from any source stream actually generating XML chunks - for example, an HTTP connection, or a file
-
outputs such
WikiPage
objects to the given target stream
import { Readable, Writable } from "node:stream";
import { pipeline } from "node:stream/promises";
import { WikiTransform } from "@giancosta86/wiki-transform";
export async function extractWikiPages(
source: Readable,
target: Writable
): Promise<void> {
const wikiTransform = new WikiTransform();
return pipeline(source, wikiTransform, target);
}
Constructor parameters
-
pageTag
: if present, defines the tag opening each page, without angle brackets. Default:"page"
-
logger
: aLogger
interface, as exported by unified-logging. Default: no logger -
highWaterMark
: if present, passed to the base constructor -
signal
: if present, passed to the base constructor
Additional notes
As a convenience utility, especially for testing, the package also provides a wikiPageToXml()
function, which converts a WikiPage to XML - using a CDATA block in every field.
Further reference
For additional examples, please consult the unit tests in the source code repository.