scraped
TypeScript icon, indicating that this package has built-in type declarations

0.6.1 • Public • Published

Scraped

A type-safe API to scrape web content

Installation

Install using your favorite npm package manager:

pnpm add scraped

Usage

The basic usage pattern can be done in a lazy or immediate fashion:

  1. Lazily configure a page you want to scrape with page:

    Imagine we will want to scrape a site on some interval or based on a user event but we want to configure the information we're interested in up front https://example.com. We could that with code something like this:

    import { page } from "scraped";
    const example = page("example", {
        title: { first: "head title" },
        image: { first: `meta[property="og:image"]` },
        hrefs: { all: `a` }
    });

    This will turn example into a strongly typed configuration for the scraping you'd like to perform but will not actually perform the scraping operation until the provided scrape method is called:

    const results = await example.scrape("https://example.com");
  2. Scrape and Configure as one step:

    Sometimes the partial application / lazy loading approach isn't of particular value and in those cases you can simply jump to the action with:

    import { scrape } from "scraped";
    const results = await scrape("https://example.com", {
        title: { first: "head title" },
        image: { first: `meta[property="og:image"]` },
        hrefs: { all: `a` }
    });

The Scraping API

Building Blocks

In the Usage section we showed two types of queries which are easy to perform for your scraping needs but didn't explicitly call this out so let's start with that. As you already saw, you create a key-value pair of things you'd like to extract from the page:

  • the keys in the key-value pair are the names of the things you'd like to refer to
  • the values are the "query" you're performing
  • in the Usage section we saw the following query types:
    • QueryFirst - identified by the { first: selector } property; this will return either a DOM Element or null depending on whether the selector can be found.
    • QueryAll - identified by the { all: selector } property; this will always return an array of DOM Elements (though if the selector is not found the array will be empty)

These two query types are probably all you really need as once you have the DOM's element you can drill down further into the details you're explicitly interested in. That said, it's often nice get back the precise attribute, text, html, etc. that you're looking for as a strongly typed return type. In the next section we'll cover how you can achieve this.

el Based Refinement

Since the two building block queries we covered in the prior section both attempt to provide the user with a DOM Element (specifically it will be typed using the Happy Dom's IElement structure) there are two refinements which can get you a lot further in extracting exactly what you want: RefinedQueryFirst and RefinedQueryAll. Let's use the same example code but more refined (imagine dijon mustard versus yellow mustard):

const result = await scrape("https://example.com", {
    title: { first: "head title", refine: el => el?.textContent },
    image: { first: `meta[property="og:image"]`, refine: el => el?.textContent },
    hrefs: { all: `a`, refine: el => el.getAttribute("href") }
});

Now with this simple addition to our querying skills we get back useful final content for the scraping. The type of result will be:

interface Result {
    _from: "__immediate__"; // if you'd used a lazy-loaded template then the name of it
    _url: "https://example.com";
    title: string | undefined;
    image: string | undefined;
    hrefs: string[]
}

In this example we had a very simple job of converting an IElement into a scalar value but you can pass in any function you like and assuming the function is strongly typed we'll assign your resulting output in as the typed result type of your function.

some Filtering

We talked about the QueryAll filter and it provides all elements which meet your selector criterion. There is a mild variant on the QueryAll and RefinedQueryAll which are the QuerySome and RefinedQuerySome. Let's use our prior "all" example and change it to a "some" example:

const result = await scrape("https://example.com", {
    hrefs: { 
        some: `a`, 
        where: el => el.className.split(" ").includes("foobar"),
        refine: el => el.getAttribute("href") 
    }
});

Now we have the hrefs of links on the page which contain the class "foobar" but not the others.

The select() Helper

This library internally leverages the @yankeeinlondon/happy-wrapper repo which provides useful utilities for working with Happy DOM. One of the really nice utilities that it implements -- and this library proxies out to its users -- is the select() utility.

The select(el) utility consumes an element and then provides a useful API surface which includes:

  • findAll(sel): IElement[]
  • findFirst(sel): IElement | null
  • mapAll(sel)<O>(el: IElement => O)

The entire API surface is typed and has useful doc comments to help guide you in its use.

Scrape Composition and Provided Patterns

When you're defining a scraping template with page() you have a high-level form of reuse for scraping but there are opportunities to gain reuse at the individual key/value pairing of a definition too.

As an example, it may be very common for pages to have "links" on the page which you want to scrape off. Let's say that you're interested in just gaining a list of all the link URLs on the page:

const template = page("example", {
    links: { all: "a", refine: el => el.getAttribute("href") }
});

This will work and isn't all that much text to put a future page that needs the same thing but what if you not only wanted to get links, you wanted to know the classes on each, maybe you wanted to distinguish between links within the site versus externally and maybe you only wanted to match the links inside one part of the page rather then the page at large. Now a reuse pattern for links sounds like a better idea and you can create one for yourself very easily.

Reuse in this case is probably best attained at the QuerySelector level which in our example above is the value of the links key. If we understand that every key/value pair in our definition of a page template has a value of QuerySelector we can build a higher-level function to do our bidding.

So long as our helper function returns a QuerySelector like we see below:

const links: (selector: string): QuerySelector = { ... };

we can replace our link scraping with something like this:

const template = page("example", {
    links: links(".main-content")
});

This simple example shows the pattern and this library exports more powerful versions of the links helper along with several others:

  • links(options) - find links in all or part of the page, optionally provide a filter function to eliminate some based on link attributes
  • images(options) - find all images in some or part of the page, distinguish between "self-hosted" and external
  • h1(opt), h2(opt), h3(opt), h4(opt) - extract heading level text and heading attributes
  • meta(options) - get valuable meta data that often exists on a page including:
    • title - the text in the title attribute
    • description - a description of page if found in <meta> tag
    • image - an image to represent the page if found in <meta> tag

All provided helpers are strongly typed with good comments to help you use them in a self-documenting manner. Also, if you're wanting to create your own abstracts have a look on Github at the source for these to help you get a good starting point.

Readme

Keywords

Package Sidebar

Install

npm i scraped

Weekly Downloads

0

Version

0.6.1

License

MIT

Unpacked Size

62.5 kB

Total Files

5

Last publish

Collaborators

  • ksnyde