nodejs-dirty-html-content-parser

Module for parsing content from dirty HTML.

It uses diff for extracting content fragments from html documents. First, you have to register a reference html document with string position markers defining different types of content. The module uses this reference to find the same type of content in other html documents, by bruteforcing for the smallest diff.

Since the module is just using string diffs, this method works on dirty invalid html.

To reduce the number of diffs to bruteforce, all defined contents must be between tags (see the result in example code below). That can be any kind of tag, an opening tag, closing tag or both. TODO: This must be fixed for version 0.0.0.0.0.1

Yo can define a validator function in the reference, to increase the chanses of proper matching.

var parser = new Parser();
parser.reference('title', {
	html: referenceHtml,
	start: 33431,
	end: 33479,
	validator: function (data) {
		if (data.indexOf('<h1>') === 0) return true;
		return false;
	}
});
parser.reference('author', {
	html: referenceHtml,
	start: 33482,
	end: 33533,
	validator
});
parser.parse(html, function (data) {
	console.dir(data);
	/*
		Example result:
		{
			title: '<h1>Example title</h1>',
			author: '<br />John Doe, Bagarmossen</div>'
		}
	*/
});

dirty-html-content-parser

nodejs-dirty-html-content-parser

Dependencies (6)

Dev Dependencies (0)

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Last publish

Collaborators

dirty-html-content-parser

nodejs-dirty-html-content-parser

Dependencies (6)

Dev Dependencies (0)

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Last publish

Collaborators

Weekly Downloads