Find The Main Content In An HTML Page
Module for finding the main content on a page with the help of Cheerio. It can convert it into markdown, text or keep it in HTML.
It removes header, footer, menu, sidebar, ...
Installation
$ npm install find-main-content -S
You need also to use Cheerio
$ npm install cheerio -S
Simple usage
const cheerio = ;const findContent = ; const $ = cheerio; // Return a nice data structure within the main content &// some extract infos on links, images, headers, title, description, ...const html = ; // get the main content in the html formatconst txt = ; // get the main content in the txt formatconst md = ; // get the main content in the markdown format
Options
You can control how to extract the main div with some options. You can specify a subset of the following attributes.
const options = // If more then one H1 is found, use the first one as the main title of the page useFirstH1: true // Remove the H1 from the main content, the H1 will be in the final json structure removeH1FromContent: true // Some site set some links in Hn, if true, we remove them removeHeadersWithoutText: true // if true, don't add the images in the final extraction removeImages: true // Remove HTML tag figcaption removeFigcaptions: true // Replace links by their anchor text replaceLinks: true // Remove HTML Form removeForm: false // Remove basic html tags that have no children removeEmptyTag: false // Remove tags that match to selectors removeTags : '... ' // list of selectors separated by comma or line break // The HTML selector. If specified, the main content will be extract from the html element that matchs to the selector htmlSelector : '...' ; const cheerio = ;const findContent = ; const $ = cheerio; const data = ;
Structure returned by the function findContent
{
title: '...',
description: "...',
images: [
{
src: 'https://... .jpg',
alt: '...'
},
...
],
links: [
{
href: 'https://...',
text: '...'
},
],
headers: [
{
type: 'h1',
text: '...'
},
{
type: 'h2',
text: '...'
}
...
],
content: '....' // in either html, markdown or txt format
}