xstruct

0.7.9 • Public • Published

xstruct

Set of tools for structured data extraction from web.

Build Status Coverage Status

NPM

Installation

npm i xstruct --save

Example

Example of how easy it is to extract, for example, comments from dou.ua forum.

var $ = require('xstruct');
 
return $.getHtml('http://dou.ua/forums/topic/14416/')
    .then(function (html) {
        return html('.b-comment').map(function () {
            var el = $.wrapHtml(this);
            return {
                author: el.find('.avatar').text(),
                time: el.find('.comment-link').text(),
                text: el.find('.text').contents().map(function () {
                    return $.wrapHtml(this).text();
                }).get()
            };
        }).toArray();
    })
    .map(function (post) {
        return {
            author: $.cleanText(post, 'author'),
            time: $.cleanText(post, 'time'),
            text: $.cleanText(post, 'text', { singleline: true })
        };
    })
    .done(console.log, console.log);

Description

getHtml(url[, qs][, encoding])

Returns promise with downloaded and cheerio-wrapped HTML (optionally, if encoding is specified, document will be converted before passing it to cheerio). If qs (query string object) is specified, query string will be appended to url.

getJson(url[, qs])

Returns promise with downloaded and parsed JSON. If qs (query string object) is specified, query string will be appended to url.

postForm(url, form)

Returns promise with result of form posting. Activates cookie persistence.

request(options)

Promised version of request.js root function.

wrapHtml(cheerioElement)

Calls cheerio(cheerioElement) and returns result synchronously.

format

Alias for util.format.

cleanText(obj, path[, options])

Takes text from object using path and cleans it by removing heading and trailing spaces, removing space and period repetitions, converting to single-line text if options.singleline is specified, and also removing any characters from ones specified via options.remove (if specified). Returns null if result is empty string or nothing.

cleanNumber(obj, path)

Acts like cleanText, but casts result to number in the end. If result is not-a-number, returns null.

cleanDateTime(obj, path[, options])

Acts like cleanText, but casts result to date in the end (using moment.js). If result is not a valid date, returns null. You can optionally specify date-time format via options.format.

cleanObject(obj)

Returns object as is or null if all its properties do not have value.

_.*

Exposes all functions from lodash.

limit(requests, period)

Limits library to do at most requests number of HTTP-requests per period in milliseconds.

Building blocks

This library is built with heavy usage of request, cheerio, lodash and bluebird. Also it uses iconv-lite, moment and util as additional utils.

License

MIT

Package Sidebar

Install

npm i xstruct

Weekly Downloads

4

Version

0.7.9

License

MIT

Last publish

Collaborators

  • titarenko