[!WARNING] Early Development Stage: This project is under active development and may undergo significant API changes between versions.
Unified interface for web archive providers
- Simple API for listing archived URLs for a domain
- Support for multiple archive providers:
- Internet Archive's Wayback Machine (web.archive.org)
- Archive.today (archive.ph)
- Perma.cc (perma.cc)
- Common Crawl (commoncrawl.org)
- Consistent, standardized response format with platform-specific metadata
- Tree-shakable design: import only the providers you need
- Configurable request options
- TypeScript support
- Integrated caching system with unstorage
# npm
npm install omnichron
# yarn
yarn add omnichron
# pnpm
pnpm add omnichron
import { createArchive, providers } from 'omnichron'
// Create an archive client for Wayback Machine
const waybackArchive = createArchive(providers.wayback())
// Get archived snapshots for a domain (with optional limit)
const response = await waybackArchive.getSnapshots('example.com', { limit: 100 })
if (response.success) {
console.log('Archived snapshots:', response.pages)
// [
// {
// url: 'https://example.com',
// timestamp: '2022-01-01T00:00:00Z',
// snapshot: 'https://web.archive.org/web/20220101000000/https://example.com',
// _meta: {
// timestamp: '20220101000000',
// status: 200
// }
// },
// ...
// ]
} else {
console.error('Error:', response.error)
}
// Using Archive.today
const archiveTodayArchive = createArchive(providers.archiveToday())
const archiveTodayResponse = await archiveTodayArchive.getSnapshots('example.com')
// Nuxt.js API endpoint (server/api/snapshots.ts)
import { createArchive, providers } from 'omnichron'
const archive = createArchive(
providers.all({
timeout: 60 * 10
})
)
export default defineEventHandler(async () => {
const snapshots = await archive.getSnapshots('example.com')
return snapshots
})
For better performance and smaller bundle size, the providers are lazy-loaded:
// Only import Wayback Machine
import { createArchive, providers } from 'omnichron'
// The provider is loaded on-demand
const archive = createArchive(providers.wayback())
The library uses TypeScript for type safety, including type assertions for API responses:
// Example of typed response handling
interface PermaccResponse {
objects: Array<{
guid: string
url: string
creation_timestamp: string
}>
meta: {
total_count: number
}
}
// Using type assertion for proper typing
const response = await ofetch('/api/endpoint', options) as PermaccResponse
// Now you have full autocompletion and type safety
console.log(response.objects[0].guid)
console.log(response.meta.total_count)
Perma.cc requires an API key for authentication:
import { createArchive, providers } from 'omnichron'
// Create with required API key
const archive = createArchive(providers.permacc({
apiKey: 'YOUR_API_KEY'
}))
const response = await archive.getSnapshots('example.com')
omnichron provides an integrated caching system that helps reduce API calls and improve performance:
import { createArchive, providers, configureStorage } from 'omnichron'
import fsDriver from 'unstorage/drivers/fs'
// Configure the cache with custom settings
configureStorage({
// Use filesystem driver for persistent cache
driver: fsDriver({ base: './cache' }),
// Set cache TTL (time-to-live) in milliseconds (default: 7 days)
ttl: 24 * 60 * 60 * 1000, // 1 day
// Enable/disable cache globally (default: true)
cache: true,
// Set a custom cache key prefix (default: 'omnichron')
prefix: 'my-app-cache'
})
const archive = createArchive(providers.wayback())
// Use cache (default behavior)
const response1 = await archive.getSnapshots('example.com')
// First call hits API, subsequent calls use cache
const response2 = await archive.getSnapshots('example.com')
console.log('From cache:', response2.fromCache) // true
// Bypass cache for specific requests
const freshResponse = await archive.getSnapshots('example.com', { cache: false })
CommonCrawl provides access to massive web archives through different crawl collections:
import { createArchive, providers } from 'omnichron'
// Create with a specific collection or use latest (default)
const archive = createArchive(providers.commoncrawl({
collection: 'CC-MAIN-2023-50',
limit: 50 // Maximum number of results
}))
const response = await archive.getSnapshots('example.com')
All providers return data in a consistent format with standardized fields plus provider-specific metadata:
interface ArchiveResponse {
success: boolean; // Boolean indicating success/failure
pages: ArchivedPage[]; // Array of archived pages
error?: string; // Error message if success is false
_meta?: Record<string, any>; // Response-level provider-specific metadata
fromCache?: boolean; // Indicates if response came from cache
}
interface ArchivedPage {
url: string; // The original URL (consistent across all providers)
timestamp: string; // ISO 8601 date format (consistent across all providers)
snapshot: string; // Direct URL to the archived version of the page
_meta: { // Provider-specific metadata
// For Wayback Machine:
timestamp?: string; // Original timestamp format
status?: number; // HTTP status code
// For Archive.today:
hash?: string; // Hash from the archive URL
raw_date?: string; // Original date string from archive.today
// For Perma.cc:
guid?: string; // Perma.cc's unique identifier
title?: string; // Title of the archived page
status?: string; // Status of the archived page
created_by?: string; // ID of the user who created the archive
// For Common Crawl:
digest?: string; // Content digest (hash)
mime?: string; // MIME type of the content
length?: string; // Content length
collection?: string; // Common Crawl collection identifier
// Provider-specific metadata examples may vary by provider
// Each provider includes relevant metadata for its archive format
};
}
omnichron includes several performance optimizations for handling large volumes of requests:
import { createArchive, providers } from 'omnichron'
// Create archive with performance options
const archive = createArchive(providers.wayback(), {
// Control parallel requests (default: 5)
concurrency: 10,
// Control batch processing size (default: 50)
batchSize: 100,
// Set request timeout in milliseconds (default: 30000)
timeout: 60000,
// Configure retry attempts for failed requests (default: 2)
retries: 3
})
// These options can also be set per request
const response = await archive.getSnapshots('example.com', {
concurrency: 5,
timeout: 45000
})
Key performance features:
- Concurrency control: Limits the number of simultaneous requests to prevent overwhelming the remote server
- Batch processing: Processes large datasets in manageable chunks to optimize memory usage
- Configurable timeouts: Allows setting custom timeouts for all or specific requests
- Automatic retries: Includes intelligent retry strategy with configurable delay and status codes
- Error handling: Provides detailed error information with context for easier debugging
You can now use multiple archive providers simultaneously:
import { createArchive, providers } from 'omnichron'
// Option 1: Use the all() helper
const allProviders = providers.all()
const multiArchive = createArchive(allProviders)
// Option 2: Create archive with specific providers
const multiArchive = createArchive([
providers.wayback(),
providers.archiveToday(),
providers.permacc({ apiKey: 'YOUR_API_KEY' })
])
// This will query all providers in parallel and combine results
const response = await multiArchive.getSnapshots('example.com', {
limit: 100,
concurrency: 3 // Maximum number of providers to query simultaneously
})
// Results are automatically merged and sorted by date (newest first)
console.log(response.pages)
// Response includes metadata about the multi-provider query
console.log(response._meta.providerCount) // 3
Creates an archive client for one or multiple providers.
-
providers
: A single archive provider instance or an array of providers -
options
: Global options for all requests (optional)
Returns an object with:
-
getSnapshots(domain, options?)
: Function to get archived snapshots for a domain, returning a full response object -
getPages(domain, options?)
: Function to get archived snapshots for a domain, returning only the pages array or throwing on error -
use(provider)
: Function to add a new provider to this archive instance -
useAll(providers)
: Function to add multiple providers to this archive instance at once
The individual provider factory functions are accessible through the providers object for lazy-loading:
-
providers.wayback(options?)
— Wayback Machine (web.archive.org) -
providers.archiveToday(options?)
— Archive.today (archive.ph) -
providers.permacc(options?)
— Perma.cc (perma.cc) -
providers.commoncrawl(options?)
— Common Crawl (commoncrawl.org) -
providers.webcite(options?)
— WebCite -
providers.all(options?)
— Helper that initializes all common providers at once
Gets archived snapshots for a domain from the archive provider.
-
domain
: The domain to get archived snapshots for -
options
: Request-specific options (optional)-
limit
: Maximum number of results to return -
cache
: Enable/disable caching for this request -
ttl
: Cache TTL in milliseconds for this request -
concurrency
: Maximum number of concurrent requests -
batchSize
: Number of items to process in a single batch -
timeout
: Request timeout in milliseconds -
retries
: Number of retry attempts for failed requests
-
Fetches archived snapshots for a domain, returning only the pages array or throwing an error if the request fails.
-
domain
: The domain to get archived snapshots for -
options
: Request-specific options (optional)-
limit
: Maximum number of results to return -
cache
: Enable/disable caching for this request -
ttl
: Cache TTL in milliseconds for this request -
concurrency
: Maximum number of concurrent requests -
batchSize
: Number of items to process in a single batch -
timeout
: Request timeout in milliseconds -
retries
: Number of retry attempts for failed requests
-
Configures the storage system.
-
options
: Configuration options (optional)-
driver
: Custom storage driver from unstorage -
ttl
: Default TTL in milliseconds -
cache
: Enable/disable cache globally
-
Access to the underlying storage instance.
Clears cached responses for a specific provider.
-
provider
: The provider object or slug name to clear cache for
- ✅ Internet Archive's Wayback Machine
- ✅ Archive.today
- ✅ Perma.cc
- ✅ Common Crawl
- ✅ WebCite
- 🔜 Archive-It
- 🔜 Conifer (formerly Webrecorder)
- ✅ Proxy-based lazy-loading providers with automatic tree-shaking
- ✅ Framework-agnostic design (works with Node.js, Nuxt, Edge functions, etc.)
- ✅ Local and persistent caching layer using unstorage
- ✅ Performance optimizations for high-volume requests
- Parallel processing with concurrency control
- Batch processing for large datasets
- Configurable timeouts and retries
- ✅ Structured logging with consola
- 🔜 Page Archiving API - create archives in addition to retrieving them
MIT