A powerful Node.js tool for converting PDF documents to Markdown format using advanced vision models. PDF2MD extracts text, tables, and images from PDFs and generates well-structured Markdown documents.
- Full Page Processing: Convert entire PDF pages to high-quality images for processing
- Visual Model Integration: Leverage state-of-the-art vision models for accurate text extraction
- Multiple Model Support: Compatible with OpenAI, Claude, Gemini, and Doubao vision models
- Structured Output: Generate clean, well-formatted Markdown documents
- Customizable: Configure image quality, processing options, and output format
# Clone the repository
git clone https://github.com/yourusername/pdf2md.git
cd pdf2md/pdf2md-node
# Install dependencies
npm install
# Build
npm run build
- Node.js 16.0.0 or higher
- API key for at least one of the supported vision models
import { parsePdf, getPageCount } from './src/index.js';
// Get PDF page count
const pageCount = await getPageCount('path/to/your.pdf');
console.log(`PDF has ${pageCount} pages`);
// Convert PDF to Markdown
const result = await parsePdf('path/to/your.pdf', {
apiKey: 'your-api-key',
model: 'gpt-4-vision-preview',
useFullPage: true // Use full page processing mode
});
console.log(`Markdown file generated: ${result.mdFilePath}`);
const options = {
// Output directory for generated files
outputDir: './output',
// API key for the vision model
apiKey: 'your-api-key',
// API endpoint (if using a custom endpoint)
baseUrl: 'https://api.example.com/v1',
// Vision model to use
model: 'gpt-4-vision-preview',
// Custom prompt for the vision model
prompt: 'Convert this PDF to well-structured Markdown',
// Whether to use full page processing (recommended)
useFullPage: true,
// Whether to keep intermediate image files
verbose: false,
// Image scaling factor (higher = better quality but slower)
scale: 3,
// Whether to use OpenAI-compatible API
openAiApicompatible: true,
// Concurrency (number of pages that can be processed simultaneously)
concurrency: 2,
// Progress handling callback method (allows the caller to track processing progress; the entire conversion task is only considered complete when the taskStatus is finished)
onProgress: ({ current, total, taskStatus }) => {
console.log(`Processed: ${current}, Total pages: ${total}, Task status: ${taskStatus}`);
}
};
const result = await parsePdf('path/to/your.pdf', options);
Provider | Models |
---|---|
OpenAI |
gpt-4-vision-preview , gpt-4o
|
Claude |
claude-3-opus-20240229 , claude-3-sonnet-20240229
|
Gemini | gemini-pro-vision |
Doubao | doubao-1.5-vision-pro-32k-250115 |
The project includes several test scripts to verify functionality:
# Test the full PDF to Markdown conversion process
node test/testFullProcess.js
# Test only the PDF to image conversion
node test/testFullPageImages.js
# Test specific vision models
node test/testModel.js
pdf2md-node/
├── src/
│ ├── index.js # Main entry point
│ ├── pdfParser.js # PDF parsing module
│ ├── imageGenerator.js # Image generation module
│ ├── modelClient.js # Vision model client
│ ├── markdownConverter.js # Markdown conversion module
│ └── utils.js # Utility functions
├── test/
│ ├── samples/ # Sample PDF files for testing
│ ├── testFullProcess.js # Full process test
│ └── ... (other test files)
└── package.json
PDF2MD consists of the following core modules, each responsible for specific functionality:
Coordinates the entire system:
- Receives user input (PDF path and configuration options)
- Sequentially calls other modules to complete the conversion process
- Returns the final Markdown result
Parses PDF files and extracts structured information:
- Uses PDF.js library to load PDF files
- Extracts text content, images, and graphic elements from each page
- Generates a list of rectangular areas, each representing a content block in the PDF
Renders PDF areas as images:
- Uses PDF.js rendering engine to render specified areas as high-definition images
- Supports adjustable scaling ratios to ensure image clarity
- Uses Sharp library to process and optimize images
Interacts with various vision model APIs:
- Supports multiple vision models: OpenAI, Claude, Gemini, Doubao, etc.
- Provides a unified API calling interface, encapsulating features of different models
- Handles API call errors and retry mechanisms
Converts model results to standard Markdown format:
- Processes text content returned by the model
- Formats according to Markdown syntax standards
- Merges Markdown content from multiple areas
This project is licensed under the MIT License - see the LICENSE file for details.