@disqada/scraper
TypeScript icon, indicating that this package has built-in type declarations

1.0.0 • Public • Published

Scraper

Badges

github npm

version monthly downloads

Test semantic-release

Table of Contents

About

This tool uses [himalaya][himalaya] under the hood for parsing the HTML string into JSON then this packages scrolls through it to find the desired information.

Read this for more information

License

Copyright © 2024 DisQada

This tool is licensed under the Apache 2.0 License.
See the LICENSE file for more information.

Getting Started

Basic information

As the package [himalaya] breaks down the JSON output into nodes, this package is following the same concept (HTML tag = JSON object/node) with the main node types being:

  • Element node: Container of the main information defining the tag like the tag name, attributes and children nodes
  • Text node: Container of the text value in the HTML tag
  • Comment node:

Click on the individual node link to read further details about the type

Usage

First and most importantly is that we need to have our HTML string ready in a variable

If you're gonna use the same HTML string for multiple uses then it's better to parse it by yourself then pass the JSON output to the functions (so the HTML string will be parsed once only), the following example shows how to

const { parse } = require('himalaya')

// Use another package to fetch the HTML from the web
const html = `
  <html>
    <head>
      <title>Test Page</title>
    </head>
    <body>
      <h1 id="title">Hello, world!</h1>
      <p class="content">This is a test paragraph.</p>
    </body>
  </html>
`
const nodes = parse(html)

// Rest of the code ...

A full node

const { findNode } = require('@disqada/scraper')

const node = findNode(nodes,  {
  tag: 'h1',
  attr: {
    key: 'id', 
    value: 'title'
  }
})

// node = {
//   type: 'element',
//   tagName: 'h1'
//   attributes: [{key: 'id', value: 'title' }],
//   children: []
// }

A text value

const { grabAText } = require('@disqada/scraper')

const text = grabAText(nodes, {
  title: 'p'
})

// text = 'Test Page'

TextOptions

The function grabText can be given a TextOptions object that specifies some configurations for the search process, note that it's optional

Click on the blue highlighted TextOptions to read more

An attribute value

const { grabAttr } = require('@disqada/scraper')

const attr = grabAttr(nodes, {
  tag: 'p',
}, 'class')

// attr = 'content'

CLI commands

download

You can download an HTML file and it's parsed json file under scrap folder in the root path of your project outside runtime by calling the download command cli

Arguments

Note that either --url or --path must be given

Arg name required Column3
--file true Name of the downloaded html and parsed json file
--url false Link of web page
--path false Path of a local html file (the html file will be copied to the scrap folder)

Examples

npm explore @disqada/scraper -- npm run download --url='https://example.com/sample' --file='sample'
npm explore @disqada/scraper -- npm run download --path='./samples/v1/index.html' --file='sample1'

[himalaya]: https://www.npmjs.com/package/himalaya# Scraper

Scraper

Scraper

Package Sidebar

Install

npm i @disqada/scraper

Weekly Downloads

3

Version

1.0.0

License

Apache-2.0

Unpacked Size

39.5 kB

Total Files

16

Last publish

Collaborators

  • nabil_alsaiad