dirty-html-content-parser

0.0.10 • Public • Published

nodejs-dirty-html-content-parser

Module for parsing content from dirty HTML.

It uses diff for extracting content fragments from html documents. First, you have to register a reference html document with string position markers defining different types of content. The module uses this reference to find the same type of content in other html documents, by bruteforcing for the smallest diff.

Since the module is just using string diffs, this method works on dirty invalid html.

To reduce the number of diffs to bruteforce, all defined contents must be between tags (see the result in example code below). That can be any kind of tag, an opening tag, closing tag or both. TODO: This must be fixed for version 0.0.0.0.0.1

Yo can define a validator function in the reference, to increase the chanses of proper matching.

var parser = new Parser();
parser.reference('title', {
	html: referenceHtml,
	start: 33431,
	end: 33479,
	validator: function (data) {
		if (data.indexOf('<h1>') === 0) return true;
		return false;
	}
});
parser.reference('author', {
	html: referenceHtml,
	start: 33482,
	end: 33533,
	validator
});
parser.parse(html, function (data) {
	console.dir(data);
	/*
		Example result:
		{
			title: '<h1>Example title</h1>',
			author: '<br />John Doe, Bagarmossen</div>'
		}
	*/
});

Dependencies (6)

Dev Dependencies (0)

    Package Sidebar

    Install

    npm i dirty-html-content-parser

    Weekly Downloads

    1

    Version

    0.0.10

    License

    GPLv3

    Last publish

    Collaborators

    • alfredgodoy