xstruct

Data extraction tools.

xstruct

Set of tools for structured data extraction from web.

npm i xstruct --save

Example of how easy it is to extract, for example, comments from dou.ua forum.

var $ = require('xstruct');
 
return $.getHtml('http://dou.ua/forums/topic/14416/')
    .then(function (html) {
        return html('.b-comment').map(function () {
            var el = $.wrapHtml(this);
            return {
                author: el.find('.avatar').text(),
                time: el.find('.comment-link').text(),
                text: el.find('.text').contents().map(function () {
                    return $.wrapHtml(this).text();
                }).get()
            };
        }).toArray();
    })
    .map(function (post) {
        return {
            author: $.cleanText(post, 'author'),
            time: $.cleanText(post, 'time'),
            text: $.cleanText(post, 'text', { singleline: true })
        };
    })
    .done(console.log, console.log);

Returns promise with downloaded and cheerio-wrapped HTML (optionally, if encoding is specified, document will be converted before passing it to cheerio). If qs (query string object) is specified, query string will be appended to url.

Returns promise with downloaded and parsed JSON. If qs (query string object) is specified, query string will be appended to url.

Returns promise with result of form posting. Activates cookie persistence.

Promised version of request.js root function.

Calls cheerio(cheerioElement) and returns result synchronously.

Alias for util.format.

Takes text from object using path and cleans it by removing heading and trailing spaces, removing space and period repetitions, converting to single-line text if options.singleline is specified, and also removing any characters from ones specified via options.remove (if specified). Returns null if result is empty string or nothing.

Acts like cleanText, but casts result to number in the end. If result is not-a-number, returns null.

Acts like cleanText, but casts result to date in the end (using moment.js). If result is not a valid date, returns null. You can optionally specify date-time format via options.format.

Returns object as is or null if all its properties do not have value.

Exposes all functions from lodash.

This library is built with heavy usage of request, cheerio, lodash and bluebird. Also it uses iconv-lite, moment and util as additional utils.

License

MIT