Scraper

Badges

Scraper
Getting Started
Scraper
Scraper

About

This tool uses [himalaya][himalaya] under the hood for parsing the HTML string into JSON then this packages scrolls through it to find the desired information.

Read this for more information

License

This tool is licensed under the Apache 2.0 License.
See the LICENSE file for more information.

Getting Started

Basic information

As the package [himalaya] breaks down the JSON output into nodes, this package is following the same concept (HTML tag = JSON object/node) with the main node types being:

Element node: Container of the main information defining the tag like the tag name, attributes and children nodes
Text node: Container of the text value in the HTML tag
Comment node:

Click on the individual node link to read further details about the type

Usage

First and most importantly is that we need to have our HTML string ready in a variable

If you're gonna use the same HTML string for multiple uses then it's better to parse it by yourself then pass the JSON output to the functions (so the HTML string will be parsed once only), the following example shows how to

const { parse } = require('himalaya')

// Use another package to fetch the HTML from the web
const html = `
  <html>
    <head>
      <title>Test Page</title>
    </head>
    <body>
      <h1 id="title">Hello, world!</h1>
      <p class="content">This is a test paragraph.</p>
    </body>
  </html>
`
const nodes = parse(html)

// Rest of the code ...

A full node

const { findNode } = require('@disqada/scraper')

const node = findNode(nodes,  {
  tag: 'h1',
  attr: {
    key: 'id', 
    value: 'title'
  }
})

// node = {
//   type: 'element',
//   tagName: 'h1'
//   attributes: [{key: 'id', value: 'title' }],
//   children: []
// }

A text value

const { grabAText } = require('@disqada/scraper')

const text = grabAText(nodes, {
  title: 'p'
})

// text = 'Test Page'

TextOptions

The function grabText can be given a TextOptions object that specifies some configurations for the search process, note that it's optional

Click on the blue highlighted TextOptions to read more

An attribute value

const { grabAttr } = require('@disqada/scraper')

const attr = grabAttr(nodes, {
  tag: 'p',
}, 'class')

// attr = 'content'

CLI commands

download

You can download an HTML file and it's parsed json file under scrap folder in the root path of your project outside runtime by calling the download command cli

Arguments

Note that either --url or --path must be given

Arg name	required	Column3
`--file`	true	Name of the downloaded html and parsed json file
`--url`	false	Link of web page
`--path`	false	Path of a local html file (the html file will be copied to the `scrap` folder)

Examples

npm explore @disqada/scraper -- npm run download --url='https://example.com/sample' --file='sample'

npm explore @disqada/scraper -- npm run download --path='./samples/v1/index.html' --file='sample1'

[himalaya]: https://www.npmjs.com/package/himalaya# Scraper

@disqada/scraper

Scraper

Badges

Table of Contents

About

License

Getting Started

Basic information

Usage

A full node

A text value

TextOptions

An attribute value

CLI commands

download

Arguments

Examples

Scraper

Scraper

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

@disqada/scraper

Scraper

Badges

Table of Contents

About

License

Getting Started

Basic information

Usage

A full node

A text value

TextOptions

An attribute value

CLI commands

download

Arguments

Examples

Scraper

Scraper

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads