htmlclean

Simple and lightweight cleaner that just removes whitespaces, comments, etc. to minify HTML/SVG. This differs from others in that this removes whitespaces, line-breaks, etc. as much as possible.

htmlclean

Simple and lightweight cleaner that just removes whitespaces, comments, etc. to minify HTML/SVG.
This differs from others in that this removes whitespaces, line-breaks, etc. as much as possible.

htmlclean removes the following texts.

  • The leading whitespaces, tabs and line-breaks, and the trailing whitespaces, tabs and line-breaks.
  • The unneeded whitespaces, tabs and line-breaks between HTML/SVG tags.
  • The more than two whitespaces, tabs and line-breaks (suppressed to one space).
  • HTML/SVG comments.

For example: The more than two whitespaces (even if those are divided by HTML/SVG tags) in a line are suppressed.

  • Before
<p>The <strong> clean <span> <em> HTML is here. </em> </span> </strong> </p>
  • After
<p>The <strong>clean <span><em>HTML is here.</em></span></strong></p>

The whitespace that was right side of <strong> was removed, and the left side was kept.
The both side whitespaces of <em> were removed.

The following texts are protected (excluded from Removing).

  • The texts in textarea, script and style elements, and the text nodes in pre elements.
  • The quoted texts in the tag attributes.
  • The texts in the SSI tags (PHP, JSP, ASP/ASP.NET and Apache SSI).
  • IE conditional comments. e.g. <!--[if lt IE 7]>
  • The texts between <!--[htmlclean-protect]--> and <!--[/htmlclean-protect]-->.
  • The texts that is matched by the protect option.
npm install -g htmlclean
htmlclean [options] [input1 [input2 ...]]

Command line tool needs -g option when install package.
See htmlclean -h for usage.

  • Clean index.html, and write to index.min.html.
htmlclean index.html
  • Clean index.html, and overwrite it.
htmlclean index.html -o index.html
  • Clean all HTML files in src directory, and write into public directory.
htmlclean src -o public
  • Clean all SVG files.
htmlclean *.svg
  • Get and clean web page on URL, and write to index.html.
wget -q -O - https://github.com/ | htmlclean -o index.html
  • Clean and compress index.html, and write to index.gz.
htmlclean index.html -o - | gzip > index.gz
  • Clean 3 files, and write into 1 file.
htmlclean -i head.html -i body.html -i foot.html \
-o index.html -o index.html -o index.html

In the GUI environment, drag-and-drop the target file or directory or multiple items to the htmlclean icon. Or the short cut (alias, link, etc.) icon on the desktop also works.

The htmlclean icon is found in:

npm bin -g
cleanHtml = htmlclean(sourceHtml[, options])

require('htmlclean') returns a Function. This Function accepts a source HTML, and returns a clean HTML. If you want, you can specify an options Object to second argument (see Options).

var htmlclean = require('htmlclean');
html = htmlclean(html);
 
// Or 
html = require('htmlclean')(html);

You can specify an options Object to second argument. This Object can have following properties.

Type: RegExp or Array

The texts which are matched to this RegExp are protected in addition to above Protecting list. The multiple RegExps can be specified via an Array.

Type: RegExp or Array

The texts which are matched to this RegExp are cleaned even if that text is included in above Protecting list. The multiple RegExps can be specified via an Array.
For example, a HTML as template in <script type="text/x-handlebars-template"> is cleaned via following.

html = htmlclean(html, {
  unprotect: /<script [^>]*\btype="text\/x-handlebars-template"[\s\S]+?<\/script>/ig
});

The x-handlebars-template in a type attribute above is case of using the Template Framework Handlebars. e.g. AngularJS requires ng-template instead of it.

NOTE: The RegExp has to match to a text which is not a part of the protected text. For example, the RegExp matches color: red; in <style> element, but this is not cleaned because all texts in the <style> element are protected. color: red; is a part of the protected text. The RegExp has to match to a text which is all of <style> element like /<style[\s\S]+?<\/style>/.

Type: Function

This Function more edits a HTML.
The protected texts are hidden from a HTML, and a HTML is passed to this Function. Therefore, this Function doesn't break the protected texts. A HTML which returned from this Function is restored.
NOTE: The markers \fID\f (\f is "form feed" \x0C code, ID is number) are inserted to a HTML instead of the protected texts. This Function can remove these markers, but can't add new markers. (Invalid markers will be just removed.)

See the source HTML file and the result HTML files in the sample directory.

var htmlclean = require('htmlclean'),
  fs = require('fs'),
  htmlBefore = fs.readFileSync('./before.html', {encoding: 'utf8'});
 
var htmlAfter1 = htmlclean(htmlBefore);
fs.writeFileSync('./after1.html', htmlAfter1);
 
var htmlAfter2 = htmlclean(htmlBefore, {
  protect: /<\!--%fooTemplate\b.*?%-->/g,
  unprotect: /<script [^>]*\btype="text\/x-handlebars-template"[\s\S]+?<\/script>/ig,
  editfunction(html) { return html.replace(/\begg(s?)\b/ig, 'omelet$1'); }
});
fs.writeFileSync('./after2.html', htmlAfter2);

htmlclean can't parse the malformed nested tags like <p>foo<pre>bar</p>baz</pre> precisely. And the close tags in script like <script>var foo = '</script>';</script> too. Or, ?> in PHP code, etc.
Some language parsers also mistake, then those recommend us to write code like '<' + '/script>'. This is better even if htmlclean is not used.

htmlclean removes the HTML/SVG comments that include the SSI tag like <!-- Info for admin - Foo:<?= expression ?> -->. I think it's no problem because htmlclean is used to minify HTML. If that SSI tag includes the important code for logic, use a protect option, or <!--[htmlclean-protect]--> and <!--[/htmlclean-protect]-->.

If you want to control details of editing, HtmlCompressor, HTMLMinifier and others are better choice.