scrapegpt

1.0.2 • Public • Published

Scrape GPT

Use GPT-4 to scrape any remote website based on a URL. This is a simple wrapper around the OpenAI API that allows you to scrape any website using GPT-4.

This codebase is GPT generated based of the Scrapeghost.PY codebase.

Installation

npm i scrapegpt

Usage

Use at your own risk. This library makes considerably expensive calls ($0.36 for a GPT-4 call on a moderately sized page.) Cost estimates are based on the OpenAI pricing page and not guaranteed to be accurate.

const scrape = require('scrapegpt');

const url = 'https://www.bbc.com/news/world-us-canada-57982050';

scrape(url).then((result) => {
  console.log(result);
});

Features

The purpose of this library is to provide a convenient interface for exploring web scraping with GPT.

While the bulk of the work is done by the GPT model, scrapegpt provides a number of features to make it easier to use.

JSON schema definition - Define the shape of the data you want to extract as any JSON object, with as much or little detail as you want.

Preprocessing

  • HTML cleaning - Remove unnecessary HTML to reduce the size and cost of API requests.
  • Auto-splitting - Optionally split the HTML into multiple calls to the model, allowing for larger pages to be scraped.

Postprocessing

  • JSON validation - Ensure that the response is valid JSON. (With the option to kick it back to GPT for fixes if it's not.)
  • Hallucination check - Does the data in the response truly exist on the page?

Cost Controls

  • Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked.
  • Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)
  • Allows setting a budget and stops the scraper if the budget is exceeded.

Resources

Ditto

Inspired by: https://github.com/jamesturk/scrapeghost

Package Sidebar

Install

npm i scrapegpt

Weekly Downloads

6

Version

1.0.2

License

MIT

Unpacked Size

21.7 kB

Total Files

9

Last publish

Collaborators

  • koolamusic