Scraper
Badges
Table of Contents
About
This tool uses [himalaya][himalaya] under the hood for parsing the HTML string into JSON then this packages scrolls through it to find the desired information.
Read this for more information
License
Copyright © 2024 DisQada
This tool is licensed under the Apache 2.0 License.
See the LICENSE file for more information.
Getting Started
Basic information
As the package [himalaya] breaks down the JSON output into nodes, this package is following the same concept (HTML tag = JSON object/node) with the main node types being:
- Element node: Container of the main information defining the tag like the tag name, attributes and children nodes
- Text node: Container of the text value in the HTML tag
- Comment node:
Click on the individual node link to read further details about the type
Usage
First and most importantly is that we need to have our HTML string ready in a variable
If you're gonna use the same HTML string for multiple uses then it's better to parse it by yourself then pass the JSON output to the functions (so the HTML string will be parsed once only), the following example shows how to
const { parse } = require('himalaya')
// Use another package to fetch the HTML from the web
const html = `
<html>
<head>
<title>Test Page</title>
</head>
<body>
<h1 id="title">Hello, world!</h1>
<p class="content">This is a test paragraph.</p>
</body>
</html>
`
const nodes = parse(html)
// Rest of the code ...
A full node
const { findNode } = require('@disqada/scraper')
const node = findNode(nodes, {
tag: 'h1',
attr: {
key: 'id',
value: 'title'
}
})
// node = {
// type: 'element',
// tagName: 'h1'
// attributes: [{key: 'id', value: 'title' }],
// children: []
// }
A text value
const { grabAText } = require('@disqada/scraper')
const text = grabAText(nodes, {
title: 'p'
})
// text = 'Test Page'
TextOptions
The function grabText
can be given a TextOptions
object that specifies some configurations for the search process, note that it's optional
Click on the blue highlighted
TextOptions
to read more
An attribute value
const { grabAttr } = require('@disqada/scraper')
const attr = grabAttr(nodes, {
tag: 'p',
}, 'class')
// attr = 'content'
CLI commands
download
You can download an HTML file and it's parsed json file under scrap
folder in the root path of your project outside runtime by calling the download
command cli
Arguments
Note that either
--url
or--path
must be given
Arg name | required | Column3 |
---|---|---|
--file |
true | Name of the downloaded html and parsed json file |
--url |
false | Link of web page |
--path |
false | Path of a local html file (the html file will be copied to the scrap folder) |
Examples
npm explore @disqada/scraper -- npm run download --url='https://example.com/sample' --file='sample'
npm explore @disqada/scraper -- npm run download --path='./samples/v1/index.html' --file='sample1'
[himalaya]: https://www.npmjs.com/package/himalaya# Scraper