scrapefrom

2.5.1 • Public • Published

scrapefrom

Scrape data from any webpage.

installation

$ npm i scrapefrom

import

const scrapefrom = require("scrapefrom");
// or,
// import scrapefrom from "scrapefrom"

use cases

Extract html, htmlSplits, htmlStripped, htmlStrippedSplits.

scrapefrom("https://www.npmjs.com/package/scrapefrom").then(console.log);

Extract an array of strings for all h1 tags on a page.

scrapefrom({
  url: "https://www.npmjs.com/package/scrapefrom",
  extract: "h1",
  defaultDelimiter: null,
}).then(console.log); // "{ h1: [...] }"

Extract an array of strings for all h1 tags on a page as "titles".

scrapefrom({
  url: "https://www.npmjs.com/package/scrapefrom",
  extract: { name: "titles", selector: "h1", delimiter: null },
}).then(console.log); // "{ titles: [...] }"

Extract a joined array of strings for all h1 tags on a page using a delimiter, as "title".

scrapefrom({
  url: "https://www.npmjs.com/package/scrapefrom",
  extract: { name: "title", selector: "h1", delimiter: "," },
}).then(console.log); // "{ title: "...,..." }"

Extract an array of datetime attribute values for all time tags on a page as "dates".

scrapefrom({
  url: "https://www.npmjs.com/package/scrapefrom",
  extract: {
    name: "dates",
    selector: "time",
    attribute: "datetime",
    delimiter: null,
  },
}).then(console.log); // "{ dates: [...] }"

Extract previous use cases in a single config.

scrapefrom({
  url: "https://www.npmjs.com/package/scrapefrom",
  defaultDelimiter: null,
  extracts: [
    { name: "titles", selector: "h1" },
    { name: "dates", selector: "time", attribute: "datetime" },
  ],
}).then(console.log); // "{ titles: [...], dates: [...] }"

Extract previous use cases from multiple URLs.

scrapefrom([
  {
    url: "https://www.npmjs.com/package/scrapefrom",
    defaultDelimiter: null,
    extracts: [
      { name: "titles", selector: "h1" },
      { name: "dates", selector: "time", attribute: "datetime" },
    ],
  },
  {
    url: "https://www.npmjs.com/package/async-fetch",
    defaultDelimiter: null,
    extracts: [
      { name: "titles", selector: "h1" },
      { name: "dates", selector: "time", attribute: "datetime" },
    ],
  },
]).then(console.log); // "[{ titles: [...], dates: [...] }, { titles: [...], dates: [...] }]"

Extract a list of items from a page.

scrapefrom({
  url: "https://www.npmjs.com/package/async-fetch",
  extract: {
    selector: "tbody tr",
    name: "rows",
    extracts: [
      { selector: "td:nth-child(1)", name: "key" },
      { selector: "td:nth-child(2)", name: "type" },
      { selector: "td:nth-child(3)", name: "definition" },
      { selector: "td:nth-child(4)", name: "default" },
    ],
  },
}).then(console.log); // "[ { key: "...", type: "...", definition: "...", default: "..." }, ...]"

if a page requires javascript...

By default scrapefrom utilizes fetch under the hood, but if a page is unavailable because it requires javascript, there is the option to use puppeteer (which should be able to bypass this requirement through the use of a headless chrome browser).

Package Sidebar

Install

npm i scrapefrom

Weekly Downloads

9

Version

2.5.1

License

ISC

Unpacked Size

15.9 kB

Total Files

12

Last publish

Collaborators

  • nameer