HTML Content Processor

A modern TypeScript library for cleaning, filtering, and converting HTML content to Markdown with intelligent content extraction. Supports cross-environment execution (Browser/Node.js) with automatic page type detection.

Features

🚀 Modern API Design - Clean functional and class-based APIs
🧠 Intelligent Filtering - Automatic page type detection with optimal filtering strategies
📝 High-Quality Markdown Conversion - Advanced HTML to Markdown transformation
🌐 Cross-Environment Support - Full compatibility with both browser and Node.js environments
🎯 Smart Presets - Optimized configurations for different content types
🔌 Plugin System - Extensible plugin architecture
📊 Automatic Detection - Smart detection of search engines, blogs, news, documentation, and more

Installation

npm install html-content-processor

Quick Start

Basic Usage

import { htmlToMarkdown, htmlToText, cleanHtml } from 'html-content-processor';

// Convert HTML to Markdown
const markdown = await htmlToMarkdown('<h1>Hello</h1><p>World</p>');

// Convert HTML to plain text
const text = await htmlToText('<h1>Hello</h1><p>World</p>');

// Clean HTML content
const clean = await cleanHtml('<div>Content</div><script>ads</script>');

Automatic Page Type Detection (Recommended)

The library can automatically detect page types and apply optimal filtering strategies:

import { htmlToMarkdownAuto, cleanHtmlAuto, extractContentAuto } from 'html-content-processor';

// Automatic detection with URL context
const markdown = await htmlToMarkdownAuto(html, 'https://example.com/blog-post');

// Clean HTML with automatic page type detection
const cleanHtml = await cleanHtmlAuto(html, 'https://news.example.com/article');

// Extract content with detailed page type information
const result = await extractContentAuto(html, 'https://docs.example.com/guide');
console.log('Detected page type:', result.pageType.type);
console.log('Confidence:', result.pageType.confidence);
console.log('Markdown:', result.markdown.content);

HtmlProcessor Class (Advanced Usage)

import { HtmlProcessor } from 'html-content-processor';

// Method chaining
const result = await HtmlProcessor
  .from(html)
  .withBaseUrl('https://example.com')
  .withAutoDetection() // Enable automatic page type detection
  .filter()
  .toMarkdown();

// Manual page type setting
const processor = await HtmlProcessor
  .from(html)
  .withPageType('blog') // Manually set page type
  .filter();

const markdown = await processor.toMarkdown();

Content-Specific Presets

import { 
  htmlToArticleMarkdown, 
  htmlToBlogMarkdown, 
  htmlToNewsMarkdown 
} from 'html-content-processor';

// Optimized for different content types
const articleMd = await htmlToArticleMarkdown(html, baseUrl);
const blogMd = await htmlToBlogMarkdown(html, baseUrl);
const newsMd = await htmlToNewsMarkdown(html, baseUrl);

API Reference

Core Functions

Function	Description	Return Type
`htmlToMarkdown(html, options?)`	Convert HTML to Markdown	`Promise<string>`
`htmlToMarkdownWithCitations(html, baseUrl?, options?)`	Convert HTML to Markdown with citations	`Promise<string>`
`htmlToText(html, options?)`	Convert HTML to plain text	`Promise<string>`
`cleanHtml(html, options?)`	Clean HTML content	`Promise<string>`
`extractContent(html, options?)`	Extract content fragments	`Promise<string[]>`

Automatic Detection Functions

Function	Description	Benefits
`htmlToMarkdownAuto(html, url?, options?)`	Auto-detect page type and convert to Markdown	Optimal filtering for each page type
`cleanHtmlAuto(html, url?, options?)`	Auto-detect page type and clean HTML	Smart noise removal
`extractContentAuto(html, url?, options?)`	Auto-detect and extract with detailed results	Comprehensive page analysis

Example: Using Auto-Detection

// Blog post detection
const blogResult = await htmlToMarkdownAuto(html, 'https://medium.com/@user/post');
// Automatically applies blog-optimized filtering

// News article detection  
const newsResult = await htmlToMarkdownAuto(html, 'https://cnn.com/article');
// Automatically applies news-optimized filtering

// Documentation detection
const docsResult = await htmlToMarkdownAuto(html, 'https://docs.react.dev/guide');
// Automatically applies documentation-optimized filtering

// Search engine results detection
const searchResult = await htmlToMarkdownAuto(html, 'https://google.com/search?q=query');
// Automatically applies search-results-optimized filtering

Content-Specific Presets

Function	Optimized For
`htmlToArticleMarkdown()`	Long-form articles
`htmlToBlogMarkdown()`	Blog posts
`htmlToNewsMarkdown()`	News articles
`strictCleanHtml()`	Aggressive cleaning
`gentleCleanHtml()`	Conservative cleaning

HtmlProcessor Class

// Create processor
const processor = HtmlProcessor.from(html, options);

// Configuration methods
processor.withBaseUrl(url)           // Set base URL
processor.withOptions(options)       // Update options
processor.withAutoDetection(url?)    // Enable auto-detection
processor.withPageType(type)         // Manually set page type

// Processing methods
await processor.filter(options?)     // Apply filtering
await processor.toMarkdown(options?) // Convert to Markdown
await processor.toText()             // Convert to plain text
await processor.toArray()            // Convert to fragment array
processor.toString()                 // Get cleaned HTML

// Information methods
processor.getOptions()               // Get current options
processor.isProcessed()              // Check if processed
processor.getPageTypeResult()        // Get page type detection result

Configuration Options

Filter Options (FilterOptions)

{
  threshold?: number;           // Filtering threshold (default: 2)
  strategy?: 'fixed' | 'dynamic'; // Filtering strategy (default: 'dynamic')
  ratio?: number;              // Text density ratio (default: 0.48)
  minWords?: number;           // Minimum word count (default: 0)
  preserveStructure?: boolean; // Preserve structure (default: false)
  keepElements?: string[];     // Elements to keep
  removeElements?: string[];   // Elements to remove
}

Convert Options (ConvertOptions)

{
  citations?: boolean;         // Generate citations (default: true)
  ignoreLinks?: boolean;       // Ignore links (default: false)
  ignoreImages?: boolean;      // Ignore images (default: false)
  baseUrl?: string;           // Base URL
  threshold?: number;         // Filter threshold
  strategy?: 'fixed' | 'dynamic'; // Filter strategy
  ratio?: number;             // Text density ratio
}

Automatic Page Type Detection

The library automatically detects and optimizes for these page types:

search-engine - Search engine result pages
blog - Blog posts and personal articles
news - News articles and journalism
documentation - Technical documentation
e-commerce - E-commerce and product pages
social-media - Social media content
forum - Forum discussions and Q&A
article - General articles and content
landing-page - Marketing and landing pages

How Auto-Detection Works

import { extractContentAuto } from 'html-content-processor';

const result = await extractContentAuto(html, url);

console.log('Page Type:', result.pageType.type);
console.log('Confidence:', (result.pageType.confidence * 100).toFixed(1) + '%');
console.log('Detection Reasons:', result.pageType.reasons);
console.log('Applied Filter Options:', result.pageType.filterOptions);

Environment Support

Node.js

npm install jsdom  # Recommended for best performance

Browser

Direct support, no additional dependencies required.

CDN

<script src="https://unpkg.com/html-content-processor"></script>
<script>
  // Global variable: window.htmlFilter
  htmlFilter.htmlToMarkdown(html).then(console.log);
  
  // Auto-detection example
  htmlFilter.htmlToMarkdownAuto(html, window.location.href).then(result => {
    console.log('Auto-detected content:', result);
  });
</script>

Real-World Examples

Web Scraping with Auto-Detection

import { htmlToMarkdownAuto } from 'html-content-processor';

// Scrape and convert blog post
const response = await fetch('https://blog.example.com/post-123');
const html = await response.text();
const markdown = await htmlToMarkdownAuto(html, response.url);
// Automatically detects it's a blog and applies blog-specific filtering

News Article Processing

import { extractContentAuto } from 'html-content-processor';

const result = await extractContentAuto(newsHtml, 'https://news.site.com/article');
if (result.pageType.type === 'news') {
  console.log('High-quality news content extracted');
  console.log('Confidence:', result.pageType.confidence);
}

Documentation Conversion

import { htmlToMarkdownAuto } from 'html-content-processor';

// Convert technical documentation
const docMarkdown = await htmlToMarkdownAuto(docsHtml, 'https://docs.example.com/api');
// Automatically preserves code blocks, headers, and technical content structure

Performance

⚡ Fast Processing: Optimized algorithms for quick content extraction
💾 Memory Efficient: Minimal memory footprint
🔄 Batch Processing: Handle multiple documents efficiently
📊 Smart Caching: Automatic page type detection caching

License

MIT License

html-content-processor

HTML Content Processor

Features

Installation

Quick Start

Basic Usage

Automatic Page Type Detection (Recommended)

HtmlProcessor Class (Advanced Usage)

Content-Specific Presets

API Reference

Core Functions

Automatic Detection Functions

Example: Using Auto-Detection

Content-Specific Presets

HtmlProcessor Class

Configuration Options

Filter Options (FilterOptions)

Convert Options (ConvertOptions)

Automatic Page Type Detection

How Auto-Detection Works

Environment Support

Node.js

Browser

CDN

Real-World Examples

Web Scraping with Auto-Detection

News Article Processing

Documentation Conversion

Performance

License

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads