Neptune: Planet or Myth?

    html-miner

    4.0.0 • Public • Published

    HTML Miner

    Npm Build Status Coverage Status Code Climate Issue Count

    A powerful miner that will scrape html pages for you.

    Install

    NPM

    # using npm
    npm i --save html-miner
    
    # using yarn
    yarn add html-miner

    Example

    I decided to collect common use cases inside a dedicated EXAMPLE.md. Feel free to start from Usage section or jump directly to Example page.

    If you want to experiment, an online playground is also available.

    📗 Enjoy your reading

    Usage

    Arguments

    html-miner accepts two arguments: html and selector.

    const htmlMiner = require('html-miner');
    
    // htmlMiner(html, selector);

    HTML

    html is a string and contains html code.

    let html = '<div class="title">Hello <span>Marco</span>!</div>';

    SELECTOR

    selector could be:

    STRING

    htmlMiner(html, '.title');
    //=> Hello Marco!

    If the selector extracts more elements, the result is an array:

    let htmlWithDivs = '<div>Element 1</div><div>Element 2</div>';
    htmlMiner(htmlWithDivs, 'div');
    //=> ['Element 1', 'Element 2']

    FUNCTION

    Read function in detail paragraph.

    htmlMiner(html, () => 'Hello everyone!');
    //=> Hello everyone!
    
    htmlMiner(html, function () {
        return 'Hello everyone!'
    });
    //=> Hello everyone!

    ARRAY

    htmlMiner(html, ['.title', 'span']);
    //=> ['Hello Marco!', 'Marco']

    OBJECT

    htmlMiner(html, {
        title: '.title',
        who: 'span'
    });
    //=> {
    //     title: 'Hello Marco!',
    //     who: 'Marco'
    //   }

    You can combine array and object with each other or with string and functions.

    htmlMiner(html, {
        title: '.title',
        who: '.title span',
        upper: (arg) => { return arg.scopeData.who.toUpperCase(); }
    });
    //=> {
    //     title: 'Hello Marco!',
    //     who: 'Marco',
    //     upper: 'MARCO'
    //   }

    Function in detail

    A function accepts only one argument that is an object containing:

    • $: is a jQuery-like function pointing to the document ( html argument ). You can use it to query and fetch elements from the html.

      htmlMiner(html, arg => arg.$('.title').text());
      //=> Hello Marco!
    • $scope: useful when combined with _each_ or _container_ (read special keys paragraph).

      htmlMiner(html, {
          title: '.title',
          spanList: {
              _each_: 'span',
              value: (arg) => {
                  // "arg.$scope.find('.title')" doesn't exist.
                  return arg.$scope.text();
              }
          }
      });
      //=> {
      //     title: 'Hello Marco!',
      //     spanList: [{
      //         value: 'Marco'
      //     }]
      //   }
    • globalData: is an object that contains all previously fetched datas.

      htmlMiner(html, {
          title: '.title',
          spanList: {
              _each_: '.title span',
              pageTitle: function(arg) {
                  // "arg.globalData.who" is undefined because defined later.
                  return arg.globalData.title;
              }
          },
          who: '.title span'
      });
      //=> {
      //     title: 'Hello Marco!',
      //     spanList: [{
      //         pageTitle: 'Hello Marco!'
      //     }],
      //     who: 'Marco'
      //   }
    • scopeData: similar to globalData, but only contains scope data. Useful when combined with _each_ (read special keys paragraph).

      htmlMiner(html, {
          title: '.title',
          upper: (arg) => { return arg.scopeData.title.toUpperCase(); },
          sublist: {
              who: '.title span',
              upper: (arg) => {
                  // "arg.scopeData.title" is undefined because "title" is out of scope.
                  return arg.scopeData.who.toUpperCase();
              },
          }
      });
      //=> {
      //     title: 'Hello Marco!',
      //     upper: 'HELLO MARCO!',
      //     sublist: {
      //         who: 'Marco',
      //         upper: 'MARCO'
      //     }
      //   }

    Special keys

    When selector is an object, you can use special keys:

    • _each_: creates a list of items. HTML Miner will iterate for the value and will parse siblings keys.

      {
          articles: {
              _each_: '.articles .article',
              title: 'h2',
              content: 'p',
          }
      }
    • _eachId_: useful when combined with _each_. Instead of creating an Array, it creates an Object where keys are the result of _eachId_ function.

      {
          articles: {
              _each_: '.articles .article',
              _eachId_: function(arg) {
                  return arg.$scope.data('id');
              }
              title: 'h2',
              content: 'p',
          }
      }
    • _container_: uses the parsed value as container. HTML Miner will parse siblings keys, searching them inside the container.

      {
          footer: {
              _container_: 'footer',
              copyright: (arg) => { return arg.$scope.text().trim(); },
              company: 'span' // find only 'span' inside 'footer'.
          }
      }

    For more details see the following example.

    Let's try this out

    Consider the following html snippet: we will try and fetch some information.

    <h1>Hello, <span>world</span>!</h1>
    <div class="articles">
        <div class="article" data-id="a001">
            <h2>Heading 1</h2>
            <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
        </div>
        <div class="article" data-id="a002">
            <h2>Heading 2</h2>
            <p>Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.</p>
        </div>
        <div class="article" data-id="a003">
            <h2>Heading 3</h2>
            <p>Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.</p>
        </div>
    </div>
    <footer>
        <p>&copy; <span>Company</span> 2017</p>
    </footer>
    const htmlMiner = require('html-miner');
    
    let json = htmlMiner(html, {
        title: 'h1',
        who: 'h1 span',
        h2: 'h2',
        articlesArray: {
            _each_: '.articles .article',
            title: 'h2',
            content: 'p',
        },
        articlesObject: {
            _each_: '.articles .article',
            _eachId_: function(arg) {
                return arg.$scope.data('id');
            },
            title: 'h2',
            content: 'p',
        },
        footer: {
            _container_: 'footer',
            copyright: (arg) => { return arg.$scope.text().trim(); },
            company: 'span',
            year: (arg) => { return arg.scopeData.copyright.match(/[0-9]+/)[0]; },
        },
        greet: () => { return 'Hi!'; }
    });
    
    console.log( json );
    
    //=> {
    //     title: 'Hello, world!',
    //     who: 'world',
    //     h2: ['Heading 1', 'Heading 2', 'Heading 3'],
    //     articlesArray: [
    //         {
    //             title: 'Heading 1',
    //             content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
    //         },
    //         {
    //             title: 'Heading 2',
    //             content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',
    //         },
    //         {
    //             title: 'Heading 3',
    //             content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',
    //         }
    //     ],
    //     articlesObject: {
    //         'a001': {
    //             title: 'Heading 1',
    //             content: 'Lorem ipsum dolor sit amet, consectetur adipiscing elit.',
    //         },
    //         'a002': {
    //             title: 'Heading 2',
    //             content: 'Donec maximus ipsum quis est tempor, sit amet laoreet libero bibendum.',
    //         },
    //         'a003': {
    //             title: 'Heading 3',
    //             content: 'Suspendisse viverra convallis risus, vitae molestie est tincidunt eget.',
    //         }
    //     },
    //     footer: {
    //         copyright: '© Company 2017',
    //         company: 'Company',
    //         year: '2017'
    //     },
    //     greet: 'Hi!'
    //   }

    You can find other examples under the folder /examples

    # you can test examples with nodejs
    node examples/demo.js
    node examples/site.js

    Development

    npm install
    npm test
    
    # start the playground locally
    npm start

    Install

    npm i html-miner

    DownloadsWeekly Downloads

    9

    Version

    4.0.0

    License

    MIT

    Unpacked Size

    368 kB

    Total Files

    6

    Last publish

    Collaborators

    • marcomontalbano