elematch
Efficient element matching and processing for HTML5 DOM serialized HTML.
Usage
var EleMatch = ; /** * @param * @return * array. */ { // Do something with the node return node;} // Create a matcher to handle some elements, using CSS syntax. To avoid// shipping a CSS parser to clients, CSS selectors are only supported in node.var matcher = 'test-element[foo="bar"]': handler 'foo-bar': handler ctx: hello: 'world' ; // Create the same matcher using more powerful rule objects. These are// supported in node & the client, and offer full functionality.var matcher = selector: nodeName: 'test-element' attributes: name: 'foo' operator: '=' value: 'bar' handler: handler // Optional: Request node.innerHTML / outerHTML as `ReadableStream` // instances. Only available in rule objects. stream: false selector: nodeName: 'foo-bar' handler: handler ctx: hello: world ; var testDoc = "<html><body><div>" + "<test-element foo='bar'>foo</test-element>" + "</div></body>"; // Finally, execute it all.var match = matcher; console;// {// done: true,// values: [// "<html><body><div>",// {// "nodeName": "test-element",// "attributes": {// "foo": "bar"// },// "outerHTML": "<test-element foo='bar'>foo</test-element>",// "innerHTML": "foo"// },// "</div></body>"// ]// }
Performance
Using the Barack Obama
article (1.5mb HTML, part of npm test
):
elematch
match & replace all 32<figure>
elements: 1.95mselematch
match & replace all links: 14.98mselematch
match & replace a specific link (a[href="./Riverdale,_Chicago"]
): 2.24mselematch
match & replace references section (ol[typeof="mw:Extension/references"]
): 3.7mslibxml
DOM parse: 26.3mslibxml
DOM round-trip: 50.8mshtmlparser2
DOM parse: 66.8mshtmlparser2
DOM round-trip: 99.7mshtmlparser2
SAX parse: 70.6msdomino
DOM parse: 225.8msdomino
DOM round-trip: 248.6ms
Using a smaller (1.1mb) version of the same page:
- SAX parse via libxmljs (node) and no-op handlers: 64ms
- XML DOM parse via libxmljs (node): 16ms
- XPATH match for ID (ex:
dom.find('//*[@id = "mw123"]')
) : 15ms - XPATH match for class (ex:
dom.find("//*[contains(concat(' ', normalize-space(@class), ' '), ' interlanguage-link ')]")
: 34ms
- XPATH match for ID (ex:
- HTML5 DOM parse via Mozilla's html5ever: 32ms
- full round-trip with serialization: 60ms
- HTML5 DOM parse via domino (node): 220ms
Syntactical requirements
elematch
gets much of its efficiency from leveraging the syntactic
regularity of HTML5 and
XMLSerializer
DOM serialization.
Detailed requirements (all true for HTML5 and XMLSerializer output):
- Well-formed DOM: Handled tags are balanced.
- Quoted attributes: All attribute values are quoted using single or double quotes.