xpath-html

    1.0.2 • Public • Published

    XPath HTML

    CI Release NPM Downloads


    XPath stands for XML Path Language.
    It provides a flexible non-XML syntax to address (point to) different parts of an XML document.


    With the XPath HTML, this will enable us to use such a powerful tool, navigating through the HTML DOM by XPath expression.


    If you want to learn more about XPath and know how to use different XPath expression for finding complex or dynamic elements, take a visit to this concise tutorial here.

    Table of Contents

    Installation

    xpath-html is available as a package on NPM, open up a Terminal and enter the following command:

    npm install --save xpath-html

    Usages

    Hello XPath from HTML World

    const fs = require("fs");
    const xpath = require("xpath-html");
     
    // Assuming you have an html file locally,
    // Here is the content that I scraped from www.shopback.sg
    const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");
     
    // Don't worry about the input much,
    // you are able to use the HTML response of an HTTP request,
    // as long as the argument is a string type, everything should be fine.
    const node = xpath.fromPageSource(html).findElement("//*[contains(text(), 'with love')]");
     
    console.log(`The matched tag name is "${node.getTagName()}"`);
    console.log(`Your full text is "${node.getText()}"`);
    # A fast way to download .html file above 
    $ curl https://www.shopback.sg -o shopback.html
     
    # Or from my GitHub examples 
    $ curl -O https://raw.githubusercontent.com/hieuvp/xpath-html/master/examples/shopback.html

    Bang 💥 Output should be something looks like:

    The matched tag name is "div"
    Your full text is "Made with love by"

    It is understandable, right?
    Now, you can scroll down the APIs below and diving into details.


    fromPageSource(html).findElement(expression)

    Locate an element on a page, the returned node is a representation of the underlying DOM.

    Arguments:

    Name Type Description
    html string Input HTML page's source
    expression string The given XPath expression

    Returns: Node

    Example:

    const fs = require("fs");
    const xpath = require("xpath-html");
     
    const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");
    const node = xpath.fromPageSource(html).findElement("//*[text()='Made with love by']");
     
    console.log(node.toString());

    Result:

    <div xmlns="http://www.w3.org/1999/xhtml">Made with love by</div>

    fromPageSource(html).findElements(expression)

    Search for multiple elements on a page.

    Arguments:

    Name Type Description
    html string Input HTML page's source
    expression string The given XPath expression

    Returns: Array<Node>

    Example:

    const fs = require("fs");
    const xpath = require("xpath-html");
     
    const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");
    const nodes = xpath
      .fromPageSource(html)
      .findElements("//img[starts-with(@src, 'https://cloud.shopback.com')]");
     
    console.log("Number of nodes found:", nodes.length);
    console.log("nodes[0]:", nodes[0].toString());
    console.log("nodes[1]:", nodes[1].toString());

    Result:

    Number of nodes found: 158
    nodes[0]: <img src="https://cloud.shopback.com/raw/upload/static/images/navbar/sb-logo.png" xmlns="http://www.w3.org/1999/xhtml"/>
    nodes[1]: <img src="https://cloud.shopback.com/raw/upload/static/images/navbar/desktop/icon-raf.svg" xmlns="http://www.w3.org/1999/xhtml"/>

    fromNode(xhtml).findElement(expression)

    Select an element against an XHTML format.
    Similar to fromPageSource(html).findElement(expression), but it is for a subset of an html page this time.

    Arguments:

    Name Type Description
    xhtml Node or string Either a returned node from a query
    or an xhtml string with a good shape
    expression string The given XPath expression

    Returns: Node

    Notes:

    • The input xhtml must have a namespace of xmlns="http://www.w3.org/1999/xhtml"
      e.g. <div xmlns="http://www.w3.org/1999/xhtml">Made with love by</div>

    Example:

    const fs = require("fs");
    const xpath = require("xpath-html");
     
    const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");
    const group = xpath.fromPageSource(html).findElement("//div[@class='ui-store-group']");
     
    const node = xpath.fromNode(group).findElement("//a[@href='/aliexpress']");
     
    console.log(node.toString());

    Result:

    <a class="store-logo-wrapper" href="/aliexpress" title="AliExpress Coupons &amp; Promo Codes" xmlns="http://www.w3.org/1999/xhtml"><img class="store-logo" src="https://cloud.shopback.com/t_sd_250_pad,f_auto,fl_lossy,q_auto/sg-store/49/49_logo_86958e96.png" alt="AliExpress Coupons &amp; Promo Codes"/></a>

    fromNode(xhtml).findElements(expression)

    Select multiple elements against an XHTML format.
    Same as fromPageSource(html).findElements(expression), however it is being used for querying from a part of an html.

    Arguments:

    Name Type Description
    xhtml Node or string Either a returned node from a query
    or an xhtml string with a good shape
    expression string The given XPath expression

    Returns: Array<Node>

    Notes:

    • The input xhtml must have a namespace of xmlns="http://www.w3.org/1999/xhtml"
      e.g. <div xmlns="http://www.w3.org/1999/xhtml">Made with love by</div>

    Example:

    const fs = require("fs");
    const xpath = require("xpath-html");
     
    const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");
    const group = xpath.fromPageSource(html).findElement("//div[@class='ui-store-group']");
     
    const nodes = xpath.fromNode(group).findElements("//img[contains(@src,'shopily')]");
     
    console.log("Number of nodes found:", nodes.length);
    console.log("nodes[0]:", nodes[0].toString());
    console.log("nodes[1]:", nodes[1].toString());

    Result:

    Number of nodes found: 2
    nodes[0]: <img class="store-logo" src="https://shopily-sg.s3.amazonaws.com/uploads/stores/504/504_logo_200c4121.png" alt="zChocolat Coupons &amp; Promo Codes" xmlns="http://www.w3.org/1999/xhtml"/>
    nodes[1]: <img class="store-logo" src="https://shopily-sg.s3.amazonaws.com/uploads/stores/2498/2498_logo_81f0a24d.png" alt="Bed Bath &amp; Beyond Coupons &amp; Promo Codes" xmlns="http://www.w3.org/1999/xhtml"/>

    node.getTagName()

    Retrieve the node's tag name.

    Arguments: None

    Returns: string

    Example:

    const fs = require("fs");
    const xpath = require("xpath-html");
     
    const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");
    const node = xpath.fromPageSource(html).findElement("//*[text()='Made with love by']");
     
    console.log("Single node's tag name:", node.getTagName());
     
    const nodes = xpath
      .fromPageSource(html)
      .findElements("//img[starts-with(@src, 'https://cloud.shopback.com')]");
     
    console.log("First nodes[0] tag name:", nodes[0].getTagName());
    console.log("Second nodes[1] tag name:", nodes[1].getTagName());

    Result:

    Single node's tag name: div
    First nodes[0] tag name: img
    Second nodes[1] tag name: img

    node.getText()

    Get the visible innerText of the node.

    Arguments: None

    Returns: string

    Example:

    const fs = require("fs");
    const xpath = require("xpath-html");
     
    const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");
    const node = xpath.fromPageSource(html).findElement("//*[text()='Made with love by']");
     
    console.log("Text of the node:", node.getText());
     
    const nodes = xpath
      .fromPageSource(html)
      .findElements("//div[@id='home-page-container']//*[@class='title-text']");
     
    console.log("Text of nodes[0]:", nodes[0].getText());
    console.log("Text of nodes[1]:", nodes[1].getText());

    Result:

    Text of the node: Made with love by
    Text of nodes[0]: Up to 10.0% Cash Rewards
    Text of nodes[1]: Up to 7.0% Cashback

    node.getAttribute(name)

    Retrieve the current value of the given attribute of this node.

    Arguments:

    Name Type Description
    name string The name of the attribute to query

    Returns: string

    Example:

    const fs = require("fs");
    const xpath = require("xpath-html");
     
    const html = fs.readFileSync(`${__dirname}/shopback.html`, "utf8");
    const node = xpath.fromPageSource(html).findElement("//a[text()='View All Popular Stores']");
     
    console.log("The href value is:", node.getAttribute("href"));

    Result:

    The href value is: /all-stores

    Dependencies

    Special thanks to all contributors of these libraries which are the foundation of what xpath-html was built upon.

    1. xpath
    2. xmldom
    3. xmlserializer
    4. parse5

    License

    MIT

    Made with ❤ from ShopBack.

    Install

    npm i xpath-html

    DownloadsWeekly Downloads

    128

    Version

    1.0.2

    License

    MIT

    Unpacked Size

    18.5 kB

    Total Files

    6

    Last publish

    Collaborators

    • hieu.van