A fast C++ HTML scanner that can also parse badly formed HTML
HTMLScanner is a fast HTML/XML scanner/tokenizer for node.js. The scanner tries to be forgiven and is ideal those messy HTML documents. It should parse most HTML files and ofcourse also valid XML files.
Please note there is no explicit support for namespaces. If you need a full blown XML parser, there are already many good alternatives available for Node.js.
The core of the scanner module is a fast C++ module and is for 80% based on the excelent XHScanner created by Andrew Fedoniouk, see also [http://www.codeproject.com/KB/recipes/HTML_XML_Scanner.aspx]. Without this module HTMLScanner would not be here today.
Just run the npm install command:
$ npm install htmlscanner
Or if you like to do it yourself:
$ git clone firstname.lastname@example.org:jbaron/htmlscanner.git$ cd htmlscanner$ node-waf configure build install
You should now have a file called htmlscanner.node in the lib directory. We use node-waf to build this module. Please note that older versions of node-waf use a different build directory. In that case you should find the file somewhere under the build/default directory. There are also some simple test cases included with this module. Just type for example:
$ node test/test_simple.js
The usage is straight forward:
var Scanner = require"../lib/htmlscanner"Scanner;var scanner = "<div id=12 class=important>hello</div>";dotoken = scannernext;consoledirtoken;while token0;
The token you get back from the scanner.next() call contains all the info. The above sample would produce the following output.
1"div""id""12""class""important" // Type 1 indicates OPEN TAG. Attribute key/value pairs are also included.4"hello" // Type 3 indicates TEXT2"div" // Type 2 indicates CLOSE TAG0 // Type 0 indicates END OF FILE
The first element in the array is the type, the other elements in the array depend on the first one.
There are several things still to do:
- Entity decoding of text. Although much of the code is already there, it is not yet Unicode ready.
- Add routines for entity encoding.
- Add support for Buffers. Right now only Strings are supported.
- Add some additional robustness checks.
- Compile on other platforms besides Linux. The code should be portable, but has never been tested on any other platform besides Linux. So if you have success compiling and using this on OSX or Windows please let us know.