A powerful Node.js web scraper that extracts clean, readable content from websites while keeping everything nicely organized. Perfect for creating AI training datasets! ๐ค
- ๐ Smart web crawling of internal links
- ๐ Smart retry mechanism with proxy fallback
- ๐ Clean content extraction using Mozilla's Readability
- ๐งน Smart content processing and cleaning
- ๐๏ธ Maintains original URL structure in saved files
- ๐ซ Excludes unwanted paths from scraping
- ๐ฆ Configurable rate limiting and delays
- ๐ค AI-friendly output formats (JSONL, CSV, clean text)
- ๐ Rich metadata extraction
- ๐ Combine results from multiple scrapers into a unified dataset
- Node.js (v20 or higher)
- npm
- axios - HTTP requests master
- jsdom - DOM parsing wizard
- @mozilla/readability - Content extraction genius
npm i clean-web-scraper
# OR
git clone https://github.com/mlibre/Clean-Web-Scraper
cd Clean-Web-Scraper
sudo pacman -S extra/xorg-server-xvfb chromium
npm install
# Skip chromium download during npm installation
# npm install --ignore-scripts
const WebScraper = require('clean-web-scraper');
const scraper = new WebScraper({
baseURL: 'https://example.com/news', // Required: The website base url to scrape
startURL: 'https://example.com/blog', // Optional: Custom starting URL
excludeList: ['/admin', '/private'], // Optional: Paths to exclude
exactExcludeList: ['/specific-page', // Optional: Exact URLs to exclude
/^https:\/\/host\.com\/\d{4}\/$/], // Optional: Regex patterns to exclude. this will exclude urls likee https://host.com/2023/
scrapResultPath: './example.com/website', // Required: Where to save the content
jsonlOutputPath: './example.com/train.jsonl', // Optional: Custom JSONL output path
textOutputPath: "./example.com/texts", // Optional: Custom text output path
csvOutputPath: "./example.com/train.csv", // Optional: Custom CSV output path
strictBaseURL: true, // Optional: Only scrape URLs from same domain
maxDepth: Infinity, // Optional: Maximum crawling depth
maxArticles: Infinity, // Optional: Maximum articles to scrape
crawlingDelay: 1000, // Optional: Delay between requests (ms)
batchSize: 5, // Optional: Number of URLs to process concurrently
minContentLength: 400, // Optional: Minimum content length to consider valid
// Network options
axiosHeaders: {}, // Optional: Custom HTTP headers
axiosProxy: { // Optional: HTTP/HTTPS proxy
host: "localhost",
port: 2080,
protocol: "http"
},
axiosMaxRetries: 5, // Optional: Max retry attempts
axiosRetryDelay: 40000, // Optional: Delay between retries (ms)
useProxyAsFallback: false, // Optional: Fallback to proxy on failure
// Puppeteer options for handling dynamic content
usePuppeteer: false, // Optional: Enable Puppeteer browser
});
await scraper.start();
const WebScraper = require('clean-web-scraper');
// Scrape documentation website
const docsScraper = new WebScraper({
baseURL: 'https://docs.example.com',
scrapResultPath: './datasets/docs',
maxDepth: 3, // Optional: Maximum depth for recursive crawling
includeMetadata: true, // Optional: Include metadata in output files
metadataFields: ["author", "articleTitle", "pageTitle", "description", "dataScrapedDate", "url"],
// Optional: Specify metadata fields to include
});
// Scrape blog website
const blogScraper = new WebScraper({
baseURL: 'https://blog.example.com',
scrapResultPath: './datasets/blog',
maxDepth: 3, // Optional: Maximum depth for recursive crawling
includeMetadata: true, // Optional: Include metadata in output files
metadataFields: ["author", "articleTitle", "pageTitle", "description", "dataScrapedDate"],
// Optional: Specify metadata fields to include
});
// Start scraping both sites
await docsScraper.start();
await blogScraper.start();
// Combine all scraped content into a single dataset
await WebScraper.combineResults('./combined', [docsScraper, blogScraper]);
node example-usage.js
Your AI-ready content is saved in a clean, structured format:
- ๐ Base folder:
./folderPath/example.com/
- ๐ Files preserve original URL paths
- ๐ค No HTML, no noise - just clean, structured text (
.txt
files) - ๐
JSONL
andCSV
outputs, ready for AI consumption, model training and fine-tuning
example.com/
โโโ website/
โ โโโ page1.txt # Clean text content
โ โโโ page1.json # Full metadata
โ โโโ page1.html # Original HTML content
โ โโโ blog/
โ โโโ post1.txt
โ โโโ post1.json
โ โโโ post1.html
โโโ texts/ # Numbered text files
โ โโโ 1.txt
โ โโโ 2.txt
โโโ texts_with_metadata/ # When includeMetadata is true
โ โโโ 1.txt
โ โโโ 2.txt
โโโ train.jsonl # Combined content
โโโ train_with_metadata.jsonl # When includeMetadata is true
โโโ train.csv # Clean text in CSV format
โโโ train_with_metadata.csv # When includeMetadata is true
combined/
โโโ texts/ # Combined numbered text files
โ โโโ 1.txt
โ โโโ 2.txt
โ โโโ n.txt
โโโ texts_with_metadata/ # Combined metadata text files
โ โโโ 1.txt
โ โโโ 2.txt
โ โโโ n.txt
โโโ combined.jsonl # Combined JSONL content
โโโ combined_with_metadata.jsonl
โโโ combined.csv # Combined CSV content
โโโ combined_with_metadata.csv
The actual article content starts here. This is the clean, processed text of the article that was extracted from the webpage
articleTitle: Palestine history
description: This is a great article about Palestine history
author: Rawan
language: en
dateScraped: 2024-01-20T10:30:00Z
url: https://palianswers.com
---
The actual article content starts here. This is the clean, processed text of the article that was extracted from the webpage.
{"text": "Clean article content here"}
{"text": "Another article content here"}
{"text": "Article content", "metadata": {"articleTitle": "Page Title", "author": "John Doe"}}
{"text": "Another article", "metadata": {"articleTitle": "Second Page", "author": "Jane Smith"}}
{
"url": "https://example.com/page",
"pageTitle": "Page Title",
"description": "Page description",
"language": "en",
"canonicalUrl": "https://example.com/canonical",
"ogTitle": "Open Graph Title",
"ogDescription": "Open Graph Description",
"ogImage": "https://example.com/image.jpg",
"ogType": "article",
"dataScrapedDate": "2024-01-20T10:30:00Z",
"originalHtml": "<html>...</html>",
"articleTitle": "Article Title",
}
text
"Clean article content here"
"Another article content here"
text,articleTitle,author,description
"Article content","Page Title","John Doe","Page description"
"Another article","Second Page","Jane Smith","Another description"
This project supports Palestinian rights and stands in solidarity with Palestine. We believe in the importance of documenting and preserving Palestinian narratives, history, and struggles for justice and liberation.
Free Palestine ๐ต๐ธ