Nodejs module for extracting web page content using Cheerio.
Turn any web page into a clean view. This module is based on luin's readability project.
npm install readability-js
read(html [, options], callback)
- html url or html code.
- options is an optional options object
- callback is the callback to run - callback(error, article, meta)
var read = require('readability-js');
read('', function(err, article, meta) {
// Main Article
// Title
// Article HTML Source Code
NB If the page has been marked with charset other than utf-8, it will be converted automatically. Charsets such as GBK, GB2312 is also supported.
readability-js will pass the options to request directly. See request lib to view all available options.
readability-js has 2 additional options:
onlyArticleBody (Boolean) - get only article body or all main content;
preprocess - which should be a function to check or modify downloaded source before passing it to readability.
read(url, {
preprocess: function(source, response, contentType, callback) {
if (source.length > maxBodySize) {
return callback(new Error('too big'));
callback(null, source);
}, function(err, article, response) {
Article object
content - The article content of the web page. Return false if failed. Is a Cheerio object.
title - The article title of the web page. It's may not same to the text in the
tag. -
excerpt - The article description from any description, og:description or twitter:description