A robust MemberJunction package for synchronizing entities with vector databases by transforming entity records into vector representations using embedding models.
The @memberjunction/ai-vector-sync
package provides a comprehensive solution for:
- Converting MemberJunction entities into vector embeddings
- Storing embeddings in vector databases (currently supports Pinecone)
- Managing the synchronization lifecycle between entities and their vector representations
- Supporting batch processing for large datasets
- Providing template-based document generation for vectorization
npm install @memberjunction/ai-vector-sync
Before using this package, ensure you have:
-
SQL Database with MemberJunction Framework
A properly configured SQL database with the MemberJunction framework installed. -
API Keys
- Embedding model API key (supports OpenAI, Mistral, etc.)
- Vector database API key (currently supports Pinecone)
-
Entity Configuration
- Entity Document record defined in MemberJunction
- Associated template for specifying which entity properties to vectorize
Transform entity records into high-dimensional vectors that capture the semantic meaning of the data.
Efficiently handle large datasets with configurable batch sizes for:
- Record fetching
- Vectorization
- Database upsertion
Use MemberJunction templates to define which entity fields and relationships to include in vectorization.
Seamlessly integrate with vector databases through the MemberJunction AI infrastructure.
import { EntityVectorSyncer } from '@memberjunction/ai-vector-sync';
import { UserInfo } from '@memberjunction/core';
// Initialize the syncer
const syncer = new EntityVectorSyncer();
// Configure the syncer (required before first use)
await syncer.Config(false, contextUser);
// Vectorize an entity
const params = {
entityID: 'your-entity-id',
entityDocumentID: 'your-entity-document-id',
listBatchCount: 50, // Optional: records per batch (default: 50)
VectorizeBatchCount: 50, // Optional: vectorization batch size (default: 50)
UpsertBatchCount: 50, // Optional: upsert batch size (default: 50)
StartingOffset: 0 // Optional: skip records for resuming
};
// Start vectorization (runs asynchronously)
syncer.VectorizeEntity(params, contextUser);
// Vectorize only records within a specific list
const params = {
entityID: 'your-entity-id',
entityDocumentID: 'your-entity-document-id',
listID: 'your-list-id', // Only vectorize records in this list
listBatchCount: 100
};
await syncer.VectorizeEntity(params, contextUser);
// Get entity document by ID
const entityDoc = await syncer.GetEntityDocument('document-id');
// Get entity document by name
const entityDoc = await syncer.GetEntityDocumentByName('Document Name', contextUser);
// Get all active entity documents
const activeDocs = await syncer.GetActiveEntityDocuments();
// Get active documents for specific entities
const specificDocs = await syncer.GetActiveEntityDocuments(['Entity1', 'Entity2']);
import { VectorDatabaseEntity, AIModelEntity } from '@memberjunction/core-entities';
// Create a default entity document when one doesn't exist
const entityDoc = await syncer.CreateDefaultEntityDocument(
entityID,
vectorDatabase, // VectorDatabaseEntity instance
aiModel // AIModelEntity instance
);
The main class for entity vectorization operations.
Configures the syncer and initializes required engines.
-
forceRefresh
: Force refresh of caches and engines -
contextUser
: User context for operations
VectorizeEntity(params: VectorizeEntityParams, contextUser?: UserInfo): Promise<VectorizeEntityResponse>
Vectorizes entities based on provided parameters.
-
params
: Configuration for vectorization -
contextUser
: Required user context
Retrieves an entity document by ID.
GetEntityDocumentByName(entityDocumentName: string, contextUser?: UserInfo): Promise<EntityDocumentEntity | null>
Retrieves an entity document by name.
Gets all active entity documents, optionally filtered by entity names.
CreateDefaultEntityDocument(entityID: string, vectorDatabase: VectorDatabaseEntity, aiModel: AIModelEntity): Promise<EntityDocumentEntity>
Creates a default entity document for the specified entity.
type VectorizeEntityParams = {
entityID: string; // Required: Entity to vectorize
entityDocumentID?: string; // Entity document configuration
listID?: string; // Optional: Specific list to vectorize
listBatchCount?: number; // Records per fetch batch (default: 50)
VectorizeBatchCount?: number; // Vectorization batch size (default: 50)
UpsertBatchCount?: number; // Database upsert batch size (default: 50)
StartingOffset?: number; // Skip records for resuming
CurrentUser?: UserInfo; // User context
options?: any; // Additional options
}
type EntitySyncConfig = {
EntityDocumentID: string; // Entity document to use
Interval: number; // Sync interval in seconds
RunViewParams: RunViewParams; // View parameters for fetching records
IncludeInSync: boolean; // Include in sync process
LastRunDate: string; // Last sync timestamp
VectorIndexID: number; // Vector index ID
VectorID: number; // Vector database ID
}
- Entity Document Retrieval: Fetches configuration from Entity Document record
- Model and Database Configuration: Sets up embedding model and vector database
- Data Fetching: Retrieves entity records in batches
- Vectorization: Transforms records using embedding model
- Vector Upsertion: Stores vectors in database
- EntityRecordDocument Creation: Creates tracking records
The package uses a multi-worker architecture for efficient processing:
- VectorizeTemplates Worker: Handles template-based text generation and embedding
- UpsertVectors Worker: Manages vector database operations
- EntityRecordDocument Worker: Tracks vector-entity relationships
Create a .env
file with:
# Database Configuration
DB_HOST=your-database-host
DB_PORT=1433
DB_USERNAME=your-username
DB_PASSWORD=your-password
DB_DATABASE=your-database
# API Keys
OPENAI_API_KEY=your-openai-key
MISTRAL_API_KEY=your-mistral-key
PINECONE_API_KEY=your-pinecone-key
PINECONE_HOST=your-pinecone-host
PINECONE_DEFAULT_INDEX=your-default-index
# User Configuration
CURRENT_USER_EMAIL=user@example.com
- Long-Running Processes: Vectorization can take hours for large datasets
- Batch Sizes: Adjust batch sizes based on your system resources
- Asynchronous Processing: Consider running vectorization in background processes
- Memory Usage: Monitor memory usage for large batch sizes
This package integrates seamlessly with:
-
@memberjunction/core
: Core entity and metadata functionality -
@memberjunction/ai
: AI model abstractions -
@memberjunction/ai-vectordb
: Vector database abstractions -
@memberjunction/templates
: Template processing engine
The package includes comprehensive error handling:
- Validation of entity documents and templates
- Graceful handling of API failures
- Detailed logging through MemberJunction's logging system
- Start with Small Batches: Test with small batch sizes before processing large datasets
- Monitor Progress: Use MemberJunction's logging to track vectorization progress
-
Handle Interruptions: Use
StartingOffset
to resume interrupted processes - Template Design: Design templates to include relevant fields for semantic search
- Resource Management: Consider database and API rate limits when setting batch sizes
ISC - See LICENSE file for details
MemberJunction.com