sanitize-dom

Recursive sanitizer/filter to manipulate live WHATWG DOMs rather than HTML, for the browser and Node.js.

Rationale

Direct DOM manipulation has gotten a bad reputation in the last decade of web development. From Ruby on Rails to React, the DOM was seen as something to gloriously destroy and re-render from the server or even from the browser. Never mind that the browser already exerted a lot of effort parsing HTML and constructing this tree! Mind-numbingly complex HTML string regular expression tests and manipulations had to deal with low-level details of the HTML syntax to insert, delete and change elements, sometimes on every keystroke! Contrasting to that, functions like createElement, remove and insertBefore from the DOM world were largely unknown and unused, except perhaps in jQuery.

Processing of HTML is destructive: The original DOM is destroyed and garbage collected with a certain time delay. Attached event handlers are detached and garbage collected. A completely new DOM is created from parsing new HTML set via .innerHTML =. Event listeners will have to be re-attached from the user-land (this is no issue when using on* HTML attributes, but this has disadvantages as well).

It doesn't have to be this way. Do not eliminate, but manipulate!

Save the (DOM) trees!

sanitize-dom crawls a DOM subtree (beginning from a given node, all the way down to its ancestral leaves) and filters and manipulates it non-destructively. This is very efficient: The browser doesn't have to re-render everything; it only re-renders what has been changed (sound familiar from React?).

The benefits of direct DOM manipulation:

Nodes stay alive.
References to nodes (i.e. stored in a Map or WeakMap) stay alive.
Already attached event handlers stay alive.
The browser doesn't have to re-render entire sections of a page; thus no flickering, no scroll jumping, no big CPU spikes.
CPU cycles for repeatedly parsing and dumping of HTML are eliminated.

sanitize-doms further advantages:

No dependencies.
Small footprint (only about 7 kB minimized).
Faster than other HTML sanitizers because there is no HTML parsing and serialization.

Use cases

Aside from the browser, sanitize-dom can also be used in Node.js by supplying WHATWG DOM implementations like jsdom.

The test file describes additional usage patterns and features.

For the usage examples below, I'll use sanitizeHtml just to be able to illustrate the HTML output.

By default, all tags are 'flattened', i.e. only their inner text is kept:

sanitizeHtml(document, '<div><p>abc <b>def</b></p></div>');
"abc def"

Selective joining of same-tag siblings:

// Joins the two I tags.
sanitizeHtml(document, '<i>Hello</i> <i>world!</i> <em>Goodbye</em> <em>world!</em>', {
  allow_tags_deep: { '.*': '.*' },
  join_siblings: ['I'],
});
"<i>Hello world!</i> <em>Goodbye</em> <em>world!</em>"

Removal of redundant nested nodes (ubiquitous when using a WYSIWYG contenteditable editor):

sanitizeHtml(document, '<i><i>H<i></i>ello</i> <i>world! <i>Good<i>bye</i></i> world!</i>', {
  allow_tags_deep: { '.*': '.*' },
  flatten_tags_deep: { i: 'i' },
});
"<i>Hello  world! Goodbye world!</i>"

Remove redundant empty tags:

sanitizeHtml(document, 'H<i></i>ello world!', {
  allow_tags_deep: { '.*': '.*' },
  remove_empty: true,
});
"Hello world!"

By default, all classes and attributes are removed:

// Keep all nodes, but remove all of their attributes and classes:
sanitizeHtml(document, '<div><p>abc <b class="green" data-type="test">def</b></p></div>', {
  allow_tags_deep: { '.*': '.*' },
});
"<div><p>abc <b>def</b></p></div>"

Keep all nodes and all their attributes and classes:

sanitizeHtml(document, '<div><p class="red green">abc <b class="green" data-type="test">def</b></p></div>', {
  allow_tags_deep: { '.*': '.*' },
  allow_attributes_by_tag: { '.*': '.*' },
  allow_classes_by_tag: { '.*': '.*' },
});
'<div><p class="red green">abc <b class="green" data-type="test">def</b></p></div>'

White-listing of classes and attributes:

// Keep only data- attributes and 'green' classes
sanitizeHtml(document, '<div><p class="red green">abc <b class="green" data-type="test">def</b></p></div>', {
  allow_tags_deep: { '.*': '.*' },
  allow_attributes_by_tag: { '.*': 'data-.*' },
  allow_classes_by_tag: { '.*': 'green' },
});
'<div><p class="green">abc <b class="green" data-type="test">def</b></p></div>'

White-listing of node tags to keep:

// Keep only B tags anywhere in the document.
sanitizeHtml(document, '<i>abc</i> <b>def</b> <em>ghi</em>', {
  allow_tags_deep: { '.*': '^b$' },
});
"abc <b>def</b> ghi"
 
// Keep only DIV children of BODY and I children of DIV.
sanitizeHtml(document, '<div> <i>abc</i> <em>def</em></div> <i>ghi</i>', {
  allow_tags_direct: {
    body: 'div',
    div: '^i',
  },
});
"<div> <i>abc</i> def</div> ghi"

Selective flattening of nodes:

// Flatten only EM children of DIV.
sanitizeHtml(document, '<div> <i>abc</i> <em>def</em></div> <i>ghi</i>', {
  allow_tags_deep: { '.*': '.*' },
  flatten_tags_direct: {
    div: 'em',
  },
});
"<div> <i>abc</i> def</div> <i>ghi</i>"
 
// Flatten I tags anywhere in the document.
sanitizeHtml(document, '<div> <i>abc</i> <em>def</em></div> <i>ghi</i>', {
  allow_tags_deep: { '.*': '.*' },
  flatten_tags_deep: {
    '.*': '^i',
  },
});
"<div> abc <em>def</em></div> ghi"

Selective removal of tags:

// Remove I children of DIVs.
sanitizeHtml(document, '<div> <i>abc</i> <em>def</em></div> <i>ghi</i>', {
  allow_tags_deep: { '.*': '.*' },
  remove_tags_direct: {
    'div': 'i',
  },
});
"<div>  <em>def</em></div> <i>ghi</i>"

Then, sometimes there are more than one way to accomplish the same, as shown in this advanced example:

// Keep all tags except B, anywhere in the document. Two different solutions:
 
sanitizeHtml(document, '<div> <i>abc</i> <b>def</b> <em>ghi</em> </div>', {
  allow_tags_deep: { '.*': '.*' },
  flatten_tags_deep: { '.*': 'B' },
});
"<div> <i>abc</i> def <em>ghi</em> </div>"
 
sanitizeHtml(document, '<div> <i>abc</i> <b>def</b> <em>ghi</em> </div>', {
  allow_tags_deep: { '.*': '^((?!b).)*$' }
});
"<div> <i>abc</i> def <em>ghi</em> </div>"

And finally, filter functions allow ultimate flexibility:

// change B node to EM node with contextual inner text; attach an event listener.
sanitizeHtml(document, '<p>abc <i><b>def</b> <b>ghi</b></i></p>', {
  allow_tags_direct: {
    '.*': '.*',
  },
  filters_by_tag: {
    B: [
      function changesToEm(node, { parentNodes, parentNodenames, siblingIndex }) {
        const em = document.createElement('em');
        const text = `${parentNodenames.join(', ')} - ${siblingIndex}`;
        em.innerHTML = text;
        em.addEventListener('click', () => alert(text));
        return em;
      },
    ],
  },
});
// In a browser, the EM tags would be clickable and an alert box would pop up.
"<p>abc <i><em>I, P, BODY - 0</em> <em>I, P, BODY - 2</em></i></p>"

Tests

Run in Node.js:

npm test

For the browser, run:

cd sanitize-dom
npm i -g jspm@2.0.0-beta.7 http-server
jspm install @jspm/core@1.1.0
http-server

Then, in a browser which supports <script type="importmap"></script> (e.g. Google Chrome version >= 81), browse to http://127.0.0.1:8080/test

API Reference

Functions

sanitizeNode(doc, node, [opts], [nodePropertyMap])

Simple wrapper for sanitizeDom. Processes the node and its childNodes recursively.

sanitizeChildNodes(doc, node, [opts], [nodePropertyMap])

Simple wrapper for sanitizeDom. Processes only the node's childNodes recursively, but not the node itself.

sanitizeHtml(doc, html, [opts], [isDocument], [nodePropertyMap]) ⇒ String

Simple wrapper for sanitizeDom. Instead of a DomNode, it takes an HTML string.

sanitizeDom(doc, contextNode, [opts], [childrenOnly], [nodePropertyMap])

This function is not exported: Please use the wrapper functions instead:

sanitizeHtml, sanitizeNode, and sanitizeChildNodes.

Recursively processes a tree with node at the root.

In all descriptions, the term "flatten" means that a node is replaced with the node's childNodes. For example, if the B node in abcdefghi is flattened, the result is abcdefghi.

Each node is processed in the following sequence:

Filters matching the opts.filters_by_tag spec are called. If the filter returns null, the node is removed and processing stops (see filters).
If the opts.remove_tags_* spec matches, the node is removed and processing stops.
If the opts.flatten_tags_* spec matches, the node is flattened and processing stops.
If the opts.allow_tags_* spec matches:
- All attributes not matching opts.allow_attributes_by_tag are removed.
- All class names not matching opts.allow_classes_by_tag are removed.
- The node is kept and processing stops.
The node is flattened.

Typedefs

DomDocument : Object

Implements the WHATWG DOM Document interface.

In the browser, this is window.document. In Node.js, this may for example be new JSDOM().window.document.

DomNode : Object

Implements the WHATWG DOM Node interface.

Custom properties for each node can be stored in a WeakMap passed as option nodePropertyMap to one of the sanitize functions.

Tagname : string

Node tag name.

Even though in the WHATWG DOM text nodes (nodeType 3) have a tag name #text, these are referred to by the simpler string 'TEXT' for convenience.

Regex : string

A string which is compiled to a case-insensitive regular expression new RegExp(regex, 'i'). The regular expression is used to match a Tagname.

ParentChildSpec : Object.<Regex, Array.<Regex>>

Property names are matched against a (direct or ancestral) parent node's Tagname. Associated values are matched against the current nodes Tagname.

TagAttributeNameSpec : Object.<Regex, Array.<Regex>>

Property names are matched against the current nodes Tagname. Associated values are used to match its attribute names.

TagClassNameSpec : Object.<Regex, Array.<Regex>>

Property names are matched against the current nodes Tagname. Associated values are used to match its class names.

FilterSpec : Object.<Regex, Array.<filter>>

Property names are matched against node Tagnames. Associated values are the filters which are run on the node.

filter ⇒ DomNode | Array.<DomNode> | null

Filter functions can either...

return the same node (the first argument),
return a single, or an Array of, newly created DomNode(s), in which case node is replaced with the new node(s),
return null, in which case node is removed.

Note that newly generated DomNode(s) are processed by running sanitizeDom on them, as if they had been part of the original tree. This has the following implication:

If a filter returns a newly generated DomNode with the same Tagname as node, it would cause the same filter to be called again, which may lead to an infinite loop if the filter is always returning the same result (this would be a badly behaved filter). To protect against infinite loops, the author of the filter must acknowledge this circumstance by setting a boolean property called 'skip_filters' for the DomNode) (in a WeakMap which the caller must provide to one of the sanitize functions as the argument nodePropertyMap). If 'skip_filters' is not set, an error is thrown. With well-behaved filters it is possible to continue subsequent processing of the returned node without causing an infinite loop.

sanitizeNode(doc, node, [opts], [nodePropertyMap])

Simple wrapper for sanitizeDom. Processes the node and its childNodes recursively.

Kind: global function

Param	Type	Default	Description
doc	`DomDocument`
node	`DomNode`
[opts]	`Object`	`{}`
[nodePropertyMap]	`WeakMap.<DomNode, Object>`	`new WeakMap()`	Additional node properties

sanitizeChildNodes(doc, node, [opts], [nodePropertyMap])

Simple wrapper for sanitizeDom. Processes only the node's childNodes recursively, but not the node itself.

Kind: global function

Param	Type	Default	Description
doc	`DomDocument`
node	`DomNode`
[opts]	`Object`	`{}`
[nodePropertyMap]	`WeakMap.<DomNode, Object>`	`new WeakMap()`	Additional node properties

sanitizeHtml(doc, html, [opts], [isDocument], [nodePropertyMap]) ⇒ `String`

Simple wrapper for sanitizeDom. Instead of a DomNode, it takes an HTML string.

Kind: global function
Returns: String - The processed HTML

Param	Type	Default	Description
doc	`DomDocument`
html	`string`
[opts]	`Object`	`{}`
[isDocument]	`Boolean`	`false`	Set this to `true` if you are passing an entire HTML document (beginning with the tag). The context node name will be HTML. If `false`, then the context node name will be BODY.
[nodePropertyMap]	`WeakMap.<DomNode, Object>`	`new WeakMap()`	Additional node properties

sanitizeDom(doc, contextNode, [opts], [childrenOnly], [nodePropertyMap])

This function is not exported: Please use the wrapper functions instead:

sanitizeHtml, sanitizeNode, and sanitizeChildNodes.

Recursively processes a tree with node at the root.

Each node is processed in the following sequence:

Filters matching the opts.filters_by_tag spec are called. If the filter returns null, the node is removed and processing stops (see filters).
If the opts.remove_tags_* spec matches, the node is removed and processing stops.
If the opts.flatten_tags_* spec matches, the node is flattened and processing stops.
If the opts.allow_tags_* spec matches:
- All attributes not matching opts.allow_attributes_by_tag are removed.
- All class names not matching opts.allow_classes_by_tag are removed.
- The node is kept and processing stops.
The node is flattened.

Kind: global function

Param	Type	Default	Description
doc	`DomDocument`		The document
contextNode	`DomNode`		The root node
[opts]	`Object`	`{}`	Options for processing.
[opts.filters_by_tag]	`FilterSpec`	`{}`	Matching filters are called with the node.
[opts.remove_tags_direct]	`ParentChildSpec`	`{}`	Matching nodes which are a direct child of the matching parent node are removed.
[opts.remove_tags_deep]	`ParentChildSpec`	`{'.*': ['style','script','textarea','noscript']}`	Matching nodes which are anywhere below the matching parent node are removed.
[opts.flatten_tags_direct]	`ParentChildSpec`	`{}`	Matching nodes which are a direct child of the matching parent node are flattened.
[opts.flatten_tags_deep]	`ParentChildSpec`	`{}`	Matching nodes which are anywhere below the matching parent node are flattened.
[opts.allow_tags_direct]	`ParentChildSpec`	`{}`	Matching nodes which are a direct child of the matching parent node are kept.
[opts.allow_tags_deep]	`ParentChildSpec`	`{}`	Matching nodes which are anywhere below the matching parent node are kept.
[opts.allow_attributes_by_tag]	`TagAttributeNameSpec`	`{}`	Matching attribute names of a matching node are kept. Other attributes are removed.
[opts.allow_classes_by_tag]	`TagClassNameSpec`	`{}`	Matching class names of a matching node are kept. Other class names are removed. If no class names are remaining, the class attribute is removed.
[opts.remove_empty]	`boolean`	`false`	Remove nodes which are completely empty
[opts.join_siblings]	`Array.<Tagname>`	`[]`	Join same-tag sibling nodes of given tag names, unless they are separated by non-whitespace textNodes.
[childrenOnly]	`Bool`	`false`	If false, then the node itself and its descendants are processed recursively. If true, then only the children and its descendants are processed recursively, but not the node itself (use when `node` is `BODY` or `DocumentFragment`).
[nodePropertyMap]	`WeakMap.<DomNode, Object>`	`new WeakMap()`	Additional properties for a DomNode can be stored in an object and will be looked up in this map. The properties of the object and their meaning: `skip`: If truthy, disables all processing for this node. `skip_filters`: If truthy, disables all filters for this node. `skip_classes`: If truthy, disables processing classes of this node. `skip_attributes`: If truthy, disables processing attributes of this node. See tests for usage details.

DomDocument : `Object`

Implements the WHATWG DOM Document interface.

In the browser, this is window.document. In Node.js, this may for example be new JSDOM().window.document.

Kind: global typedef
See: https://dom.spec.whatwg.org/#interface-document

DomNode : `Object`

Implements the WHATWG DOM Node interface.

Custom properties for each node can be stored in a WeakMap passed as option nodePropertyMap to one of the sanitize functions.

Kind: global typedef
See: https://dom.spec.whatwg.org/#interface-node

Tagname : `string`

Node tag name.

Even though in the WHATWG DOM text nodes (nodeType 3) have a tag name #text, these are referred to by the simpler string 'TEXT' for convenience.