html-parser
Now with less explosions!
The purpose of this library is not to be the best XML parsing library ever conceived. Because it's not. It's meant to be an HTML/XML parser that doesn't require valid HTML/XML. It's also meant to act as a sanitizer, which is the main reason for it's existence.
For example, you can just shove a blob of text into it, and it will happily parse as if it were valid XML.
Licensed under MIT.
Installation
npm install html-parser
Callback based parsing
var htmlParser = ; var html = '<!doctype html><html><body onload="alert(\'hello\');">Hello<br />world</body></html>';htmlParser; /*doctype: htmlopen: htmlclose token: >open: bodyattribute: onload=alert('hello');close token: >text: Helloopen: brclose token: />, unary: truetext: worldclose: bodyclose: html*/
Sanitization
var htmlParser = ; var html = '<script>alert(\'danger!\')</script><p onclick="alert(\'danger!\')">blah blah<!-- useless comment --></p>';var sanitized = htmlParser; console;//<p>blah blah</p>
Using callbacks
var htmlParser = ; var html = '<script>alert(\'danger!\')</script><p onclick="alert(\'danger!\')">blah blah<!-- useless comment --></p>';var sanitized = htmlParser; console;//<p>blah blah</p>
Custom data elements
You can parser custom data elements like php code or underscore templates with regex.dataElements
config
helpers;
API
/** * Parses the given string o' HTML, executing each callback when it * encounters a token. * * @param * @param * @param * @param * @param * close it (">", "/>", "?>") * @param * @param * @param * @param * @param * @param * @param * @param * @param * @param {Object.<callbackName,DataElementConfig>} [regex.dataElements] Config of data elements like docType, comment and your own custom data elements */ /** * @typedef * @property * @property * @property */ /** * Parses the HTML contained in the given file asynchronously. * * Note that this is merely a convenience function, it will still read the entire * contents of the file into memory. * * @param * @param * @param * @param */ /** * Sanitizes an HTML string. * * If removalCallbacks is not given, it will simply reformat the HTML * (i.e. converting all tags to lowercase, etc.). Note that this function * assumes that the HTML is decently formatted and kind of valid. It * may exhibit undefined or unexpected behavior if your HTML is trash. * * @param * @param * @param * @param * @param * @param * @return */
Development
git clone https://github.com/tmont/html-parser.gitcd html-parsernpm linknpm test