Als-Scraper
If something wrong or not working properly, please write me to: sh.mashkanta@gmail.com
Als-scraper is a library with 3 classes:
- Document - gets html text, builds DOM tree and allows to query elements (similar to Cherio,but different)
- Request - sends web requests (get/post/put/delete)
- Scraper - sends request (with Request) and grabs elements or their properties (with Document)
What's new?
- classList methods (add,remove)
- id not in attributes
- new method
build(filePath)
for building html from dom tree - Some minor fixes
Document
Document is a class which gets html as string and return new object with DOM tree. Thanks to this DOM tree, Document object can select elements and their properties inside collections or as single element. Collections and elements has additional methods for selecting.
Creating new object
const { readFileSync } = require("fs")
let html = readFileSync('test.html','utf-8')
let {Document} = require('als-scraper')
let document = new Document(html)
QuerySelector for single element
Then document object has created, you can select elements or collections.
For selecting single element, use $(selector)
and for selecting collections $$(selector)
.
Selecting element
document.$('div') // select first div in document
document.$('div.some') // select first div element with some class
At this time, selector supports this:
- Selects all elements -
*
- element -
div
- class -
.some-class
- id -
#some-id
- attribute -
[some-attribute="some value"]
[prop]
[prop~=value]
[prop|=value]
[prop^="value"]
[prop$="value"]
[prop*="value"]
Multiple elements selector is not supported right now (planing to add on next versions). The meaning, the folowing, won't work:
div p
div > p
div + p
p ~ ul
Each returned element, has the folowing:
element = {
parent, // parent element
prev, // previous element (null if no exists)
next, // next element (null if no exists)
innerText, // innner text of element and it's childNodes separated by |
children, // array of childNodes(elements and text nodes) - includes text element too
tagName, // tag name of element
id, // id of element if exists
attributes, // object of attributes (id not included)
classList, // array of classes and add and remove methods
$(selector),
$$(selector),
}
Text node has the folowing:
textElement = {
text,
prev,
next
}
You can add or remove classes with classList methods. Example:
let element = document.$('div')
element.classList.remove('some')
element.classList.add('another')
element.classList.add('onemore')
Also you can change element's id:
let element = document.$('div')
element.id = 'new-id'
$$()
QuerySelector for Collection To select few elements, use $$(selector)
method.
document.$$('div') // return collection of all div elements
The collection is array which has the elements and two methods: each
and parse
.
each
method gets callback function with 3 parameters: element it self, index of the element in collection and collection itself.
Here example:
let array = []
document.$$('div').each((element,index,collection) => {
if(element.innerText.includes('some text'))
array.push(element)
})
parse
method, gets two parameters: part
and fn
and return array with results.
-
part
is a part of element. It can be innerText, id, tagName or any property inside attributes. -
fn
is a filter function which gets content of part. If return true, content will be included.
Example:
new Document(htmlText).$$('div')
.parse('innerText',
content=> (content.length > 0) ? true : false)
Building html
For building html again, use build
method.
Example:
let element = document.$('div')
element.classList.add('another')
element.classList.remove('some')
element.id = 'new-id'
document.build() // return new html text
document.build([__dirname,'new-index.html']) // will create a file with new html text
Request
Request is a class which sends web requests. It has constructor and 4 request methods:
new Request(url)
.get(fn)
.post(fn,data='')
.put(fn,data)
.delete(fn,data)
// fn = function(data/error,statusCode)
-
url
parameter, has to includehttp:\\
orhttps:\\
-
fn
is a function which gets 2 paremeters: response's data or error and status code. -
data
has to be a string data to send in case of post/put/delete methods
Examle:
let {Request} = require('als-scraper')
new Request()
Scraper
Scraper.
parse(url,selector,fn,part) // fn(data/error,statusCode)
write(url,selector,filePath,part)
Example:
let {Scraper} = require('als-scraper')
let url = 'http://www.columbia.edu/~fdc/sample.html'
let selector = 'div'
let pathForFile = 'example.json'
let part = 'innerText'
Scraper.writeHtml(url,selector,pathForFile,part)
Scraper.parseHtml(url,selector,function(data,status) {
console.log(data,status)
},part)