als-scraper

0.2.1 • Public • Published

Als-Scraper

If something wrong or not working properly, please write me to: sh.mashkanta@gmail.com

Als-scraper is a library with 3 classes:

  1. Document - gets html text, builds DOM tree and allows to query elements (similar to Cherio,but different)
  2. Request - sends web requests (get/post/put/delete)
  3. Scraper - sends request (with Request) and grabs elements or their properties (with Document)

What's new?

  • classList methods (add,remove)
  • id not in attributes
  • new method build(filePath) for building html from dom tree
  • Some minor fixes

Document

Document is a class which gets html as string and return new object with DOM tree. Thanks to this DOM tree, Document object can select elements and their properties inside collections or as single element. Collections and elements has additional methods for selecting.

Creating new object

const { readFileSync } = require("fs")
let html = readFileSync('test.html','utf-8')

let {Document} = require('als-scraper')
let document = new Document(html)

QuerySelector for single element

Then document object has created, you can select elements or collections. For selecting single element, use $(selector) and for selecting collections $$(selector).

Selecting element

document.$('div') // select first div in document
document.$('div.some') // select first div element with some class

At this time, selector supports this:

  • Selects all elements - *
  • element - div
  • class - .some-class
  • id - #some-id
  • attribute - [some-attribute="some value"]
    • [prop]
    • [prop~=value]
    • [prop|=value]
    • [prop^="value"]
    • [prop$="value"]
    • [prop*="value"]

Multiple elements selector is not supported right now (planing to add on next versions). The meaning, the folowing, won't work:

  • div p
  • div > p
  • div + p
  • p ~ ul

Each returned element, has the folowing:

element = {
   parent, // parent element
   prev, // previous element (null if no exists)
   next, // next element (null if no exists)
   innerText, // innner text of element and it's childNodes separated by |
   children, // array of childNodes(elements and text nodes) - includes text element too
   tagName, // tag name of element
   id, // id of element if exists
   attributes, // object of attributes (id not included)
   classList, // array of classes and add and remove methods
   $(selector),
   $$(selector),
}

Text node has the folowing:

textElement = {
   text,
   prev,
   next
}

You can add or remove classes with classList methods. Example:

let element = document.$('div')
element.classList.remove('some')
element.classList.add('another')
element.classList.add('onemore')

Also you can change element's id:

let element = document.$('div')
element.id = 'new-id'

QuerySelector for Collection $$()

To select few elements, use $$(selector) method.

document.$$('div') // return collection of all div elements

The collection is array which has the elements and two methods: each and parse.

each method gets callback function with 3 parameters: element it self, index of the element in collection and collection itself.

Here example:

let array = []
document.$$('div').each((element,index,collection) => {
   if(element.innerText.includes('some text'))
      array.push(element)
})

parse method, gets two parameters: part and fn and return array with results.

  • part is a part of element. It can be innerText, id, tagName or any property inside attributes.
  • fn is a filter function which gets content of part. If return true, content will be included.

Example:

new Document(htmlText).$$('div')
.parse('innerText',
   content=> (content.length > 0) ? true : false)

Building html

For building html again, use build method. Example:

let element = document.$('div')
element.classList.add('another')
element.classList.remove('some')
element.id = 'new-id'
document.build() // return new html text
document.build([__dirname,'new-index.html']) // will create a file with new html text

Request

Request is a class which sends web requests. It has constructor and 4 request methods:

new Request(url)
.get(fn)
.post(fn,data='')
.put(fn,data)
.delete(fn,data)
// fn = function(data/error,statusCode)
  • url parameter, has to include http:\\ or https:\\
  • fn is a function which gets 2 paremeters: response's data or error and status code.
  • data has to be a string data to send in case of post/put/delete methods

Examle:

let {Request} = require('als-scraper')
new Request()

Scraper

Scraper.
parse(url,selector,fn,part) // fn(data/error,statusCode)
write(url,selector,filePath,part)

Example:

let {Scraper} = require('als-scraper')

let url = 'http://www.columbia.edu/~fdc/sample.html'
let selector = 'div'
let pathForFile = 'example.json'
let part = 'innerText'

Scraper.writeHtml(url,selector,pathForFile,part)

Scraper.parseHtml(url,selector,function(data,status) {
   console.log(data,status)
},part)

Package Sidebar

Install

npm i als-scraper

Weekly Downloads

2

Version

0.2.1

License

ISC

Unpacked Size

26.9 kB

Total Files

3

Last publish

Collaborators

  • alexsorkin