Efficient streaming element matching and processing for HTML5 DOM serialized HTML. Works with Web Streams as returned by fetch.
Usage
const htmlStream = ; /** * @param * @return * array. */ { // Do something with the node return node;} const testDoc = "<html><body><div>" + "<test-element foo='bar'>foo</test-element>" + "</div></body>"; const inputStream = { controller; controller; }; // Create a matcher to handle some elements, using CSS syntax. To avoid// shipping a CSS parser to clients, CSS selectors are only supported in node.var reader = inputStream transforms: selector: 'test-element[foo="bar"]' handler: handler selector: 'foo-bar' handler: handler ctx: hello: 'world' ; // Create the same matcher using more verbose selector objects. These are// especially useful when processing dynamic values, as this avoids the need to// escape special chars in CSS selectors.reader = inputStream transforms: selector: nodeName: 'test-element' attributes: 'foo' '=' 'bar' handler: handler // Optional: Request node.innerHTML / outerHTML as `ReadableStream` // instances. Only available in rule objects. stream: false ctx: hello: 'world' ; // Read matchesreader// {// done: false,// value: [// "<html><body><div>",// {// "nodeName": "test-element",// "attributes": {// "foo": "bar"// },// "outerHTML": "<test-element foo='bar'>foo</test-element>",// "innerHTML": "foo"// },// "</div></body>"// ]// };// { done: true, value: undefined }
Performance
Using the Barack Obama
article (1.5mb HTML, part of npm test
):
web-html-stream
match & replace all 32<figure>
elements: 1.95msweb-html-stream
match & replace all links: 14.98msweb-html-stream
match & replace a specific link (a[href="./Riverdale,_Chicago"]
): 2.24msweb-html-stream
match & replace references section (ol[typeof="mw:Extension/references"]
): 3.7mslibxml
DOM parse: 26.3mslibxml
DOM round-trip: 50.8mshtmlparser2
DOM parse: 66.8mshtmlparser2
DOM round-trip: 99.7mshtmlparser2
SAX parse: 70.6msdomino
DOM parse: 225.8msdomino
DOM round-trip: 248.6ms
Using a smaller (1.1mb) version of the same page:
- SAX parse via libxmljs (node) and no-op handlers: 64ms
- XML DOM parse via libxmljs (node): 16ms
- XPATH match for ID (ex:
dom.find('//*[@id = "mw123"]')
) : 15ms - XPATH match for class (ex:
dom.find("//*[contains(concat(' ', normalize-space(@class), ' '), ' interlanguage-link ')]")
: 34ms
- XPATH match for ID (ex:
- HTML5 DOM parse via Mozilla's html5ever: 32ms
- full round-trip with serialization: 60ms
- HTML5 DOM parse via domino (node): 220ms
Syntactical requirements
web-html-stream
gets much of its efficiency from leveraging the syntactic
regularity of HTML5 and
XMLSerializer
DOM serialization.
Detailed requirements (all true for HTML5 and XMLSerializer output):
- Well-formed DOM: Handled tags are balanced.
- Quoted attributes: All attribute values are quoted using single or double quotes.