Narrating Prophetic Monks

    crosspostapp-parser
    TypeScript icon, indicating that this package has built-in type declarations

    1.0.2 • Public • Published

    article-parser

    Extract main article, main image and meta data from URL.

    NPM CI test Coverage Status Quality Gate Status JavaScript Style Guide

    Demo

    Installation

    $ npm install article-parser
    
    # pnpm
    $ pnpm install article-parser
    
    # yarn
    $ yarn add article-parser

    Usage

    const { extract } = require('article-parser')
    
    // es6 module syntax
    import { extract } from 'article-parser'
    
    // test
    const url = 'https://dev.to/ndaidong/how-to-make-your-mongodb-container-more-secure-1646'
    
    extract(url).then((article) => {
      console.log(article)
    }).catch((err) => {
      console.trace(err)
    })

    Result:

    {
      url: 'https://dev.to/ndaidong/how-to-make-your-mongodb-container-more-secure-1646',
      title: 'How to make your MongoDB container more secure?',
      description: 'Start it with docker   The most simple way to get MongoDB instance in your machine is using...',
      links: [
        'https://dev.to/ndaidong/how-to-make-your-mongodb-container-more-secure-1646'
      ],
      image: 'https://res.cloudinary.com/practicaldev/image/fetch/s--qByI1v3K--/c_imagga_scale,f_auto,fl_progressive,h_500,q_auto,w_1000/https://dev-to-uploads.s3.amazonaws.com/i/p4sfysev3s1jhw2ar2bi.png',
      content: '...', // full article content here
      author: '@ndaidong',
      source: 'dev.to',
      published: '',
      ttr: 162
    }

    APIs

    extract(String url | String html)

    Load and extract article data. Return a Promise object.

    Example:

    const { extract } = require('article-parser')
    
    const getArticle = async (url) => {
      try {
        const article = await extract(url)
        return article
      } catch (err) {
        console.trace(err)
        return null
      }
    }
    
    getArticle('https://domain.com/path/to/article')

    If the extraction works well, you should get an article object with the structure as below:

    {
      "url": URI String,
      "title": String,
      "description": String,
      "image": URI String,
      "author": String,
      "content": HTML String,
      "published": Date String,
      "source": String, // original publisher
      "links": Array, // list of alternative links
      "ttr": Number, // time to read in second, 0 = unknown
    }

    addQueryRules(Array queryRules)

    Add custom rules to get main article from the specific domains.

    This can be useful when the default extraction algorithm fails, or when you want to remove some parts of main article content.

    Example:

    const { addQueryRules, extract } = require('article-parser')
    
    // extractor doesn't work for you!
    extract('https://bad-website.domain/page/article')
    
    // add some rules for bad-website.domain
    addQueryRules([
      {
        patterns: [
          /http(s?):\/\/bad-website.domain\/*/
        ],
        selector: '#noop_article_locates_here',
        unwanted: [
          '.advertise-area',
          '.stupid-banner'
        ]
      }
    ])
    
    // extractor will try to find article at `#noop_article_locates_here`
    
    // call it again, hopefully it works for you now :)
    extract('https://bad-website.domain/page/article')

    While adding rules, you can specify a transform() function to fine-tune article content more thoroughly.

    Example rule with transformation:

    const { addQueryRules } = require('article-parser')
    
    addQueryRules([
      {
        patterns: [
          /http(s?):\/\/bad-website.domain\/*/
        ],
        selector: '#article_id_here',
        transform: ($) => {
          // with $ is cheerio's DOM instance which contains article content
          // so you can do everything cheerio supports
          // for example, here we replace all <h1></h1> with <b></b>
          $('h1').replaceWith(function () {
            const h1Html = $(this).html()
            return `<b>${h1Html}</b>`
          })
          // at the end, you mush return $
          return $
        }
      }
    ])

    Please refer cheerio's docs for more info.

    Configuration methods

    In addition, this lib provides some methods to customize default settings. Don't touch them unless you have reason to do that.

    • getParserOptions()
    • setParserOptions(Object parserOptions)
    • getRequestOptions()
    • setRequestOptions(Object requestOptions)
    • getSanitizeHtmlOptions()
    • setSanitizeHtmlOptions(Object sanitizeHtmlOptions)

    Here are default properties/values:

    Object parserOptions:

    {
      wordsPerMinute: 300, // to estimate "time to read"
      urlsCompareAlgorithm: 'levenshtein', // to find the best url from list
      descriptionLengthThreshold: 40, // min num of chars required for description
      descriptionTruncateLen: 156, // max num of chars generated for description
      contentLengthThreshold: 200 // content must have at least 200 chars
    }

    Read string-comparison docs for more info about urlsCompareAlgorithm.

    Object requestOptions:

    {
      headers: {
        'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:95.0) Gecko/20100101 Firefox/95.0',
        accept: 'text/html; charset=utf-8'
      },
      responseType: 'text',
      responseEncoding: 'utf8',
      timeout: 6e4,
      maxRedirects: 3
    }

    Read axios' request config for more info.

    Object sanitizeHtmlOptions:

    {
      allowedTags: [
        'h1', 'h2', 'h3', 'h4', 'h5',
        'u', 'b', 'i', 'em', 'strong', 'small', 'sup', 'sub',
        'div', 'span', 'p', 'article', 'blockquote', 'section',
        'details', 'summary',
        'pre', 'code',
        'ul', 'ol', 'li', 'dd', 'dl',
        'table', 'th', 'tr', 'td', 'thead', 'tbody', 'tfood',
        'fieldset', 'legend',
        'figure', 'figcaption', 'img', 'picture',
        'video', 'audio', 'source',
        'iframe',
        'progress',
        'br', 'p', 'hr',
        'label',
        'abbr',
        'a',
        'svg'
      ],
      allowedAttributes: {
        a: ['href', 'target', 'title'],
        abbr: ['title'],
        progress: ['value', 'max'],
        img: ['src', 'srcset', 'alt', 'width', 'height', 'style', 'title'],
        picture: ['media', 'srcset'],
        video: ['controls', 'width', 'height', 'autoplay', 'muted'],
        audio: ['controls'],
        source: ['src', 'srcset', 'data-srcset', 'type', 'media', 'sizes'],
        iframe: ['src', 'frameborder', 'height', 'width', 'scrolling'],
        svg: ['width', 'height']
      },
      allowedIframeDomains: ['youtube.com', 'vimeo.com']
    }

    Read sanitize-html docs for more info.

    Test

    git clone https://github.com/ndaidong/article-parser.git
    cd article-parser
    npm install
    npm test
    
    # quick evaluation
    npm run eval {URL_TO_PARSE_ARTICLE}

    License

    The MIT License (MIT)


    Install

    npm i crosspostapp-parser

    DownloadsWeekly Downloads

    2

    Version

    1.0.2

    License

    MIT

    Unpacked Size

    62.1 kB

    Total Files

    47

    Last publish

    Collaborators

    • ganeshmani009