@vladimiry/ndx
TypeScript icon, indicating that this package has built-in type declarations

0.5.1 • Public • Published

ndx · GitHub license npm version codecov CircleCI Status PRs Welcome

ndx is a lightweight javascript (TypeScript) full-text indexing and searching library.

Live Demo

Reddit Comments Search Engine - is a simple demo application that indexes 10,000 reddit comments. Demo application requires modern browser features: WebWorkers and IndexedDB. Comments are stored in the IndexedDB, and search engine is working in a WebWorker.

Features

  • Multiple fields full-text indexing and searching.
  • Per-field score boosting.
  • BM25 ranking function to rank matching documents. The same ranking function that is used by default in Lucene >= 6.0.0.
  • Trie based dynamic Inverted Index.
  • Configurable tokenizer and term filter.
  • Free text queries with query expansion.
  • Small memory footprint, optimized for mobile devices.
  • Serializable/deserializable index.
  • ~1.7kb minified and gzipped.

ndx library doesn't provide any advanced text processing functions, default tokenizer breaks words on space characters, and default filter just removes non-word character. Natural is a good library that has many useful functions for text processing.

NPM Package

Npm package ndx provides commonjs, es5 and es6 modules with TypeScript typings.

Example

import { DocumentIndex } from "ndx";

const index = new DocumentIndex();
index.addField("title");
index.addField("content");

const documents = [
  {
    id: "doc1",
    title: "First Document",
    content: "Lorem ipsum dolor",
  },
  {
    id: "doc2",
    title: "Second Document",
    content: "Lorem ipsum",
  }
];

documents.forEach((doc) => {
  index.add(doc.id, doc);
});

index.search("First");
// => [{ docId: "doc1", score: ... }]

index.search("Lorem");
// => [{ docId: "doc2", score: ... }, { docId: "doc1", score: ... }]

Documentation

Creating a new Document Index

DocumentIndex<I, D>(options?: DocumentIndexOptions)

Document Index is a main object that stores all internal statistics and Inverted Index for documents.

Parametric Types

  • I is a type of document IDs.
  • D is a type of documents.

Options

/**
 * BM25 Ranking function constants.
 */
interface BM25Options {
  /**
   * Controls non-linear term frequency normalization (saturation).
   *
   * Default value: 1.2
   */
  k1?: number;

  /**
   * Controls to what degree document length normalizes tf values.
   *
   * Default value: 0.75
   */
  b?: number;
}

interface DocumentIndexOptions {
  /**
   * Tokenizer is a function that breaks a text into words, phrases, symbols, or other meaningful elements called
   * tokens.
   *
   * Default tokenizer breaks words on spaces, tabs, line feeds and assumes that contiguous nonwhitespace characters
   * form a single token.
   */
  tokenizer?: (query: string) => string[];
  /**
   * Filter is a function that processes tokens and returns terms, terms are used in Inverted Index to index documents.
   *
   * Default filter transforms all characters to lower case and removes all non-word characters at the beginning and
   * the end of a term.
   */
  filter?: (term: string) => string;
  /**
   * BM25 Ranking function constants.
   */
  bm25?: BM25Options;
}

Example

/**
 * Creating a simple index with default options.
 */
const default = new DocumentIndex();

/**
 * Creating an index with changed BM25 constants.
 */
const index = new DocumentIndex({
  bm25: {
    k1: 1.3,
    b: 0.8,
  },
});

Adding a text field to an index

addField(fieldName: string, options?: FieldOptions) => void

The first step after creating a document index should be registering all text fields. Document Index is indexing only registered text fields in documents.

Each field can have its own score boosting factor, score boosting factor will be used to boost score when ranking documents.

Options

interface FieldOptions<D> {
  /**
   * Getter is a function that will be used to get value for this field. If getter function isn't specified, field name
   * will be used to get value.
   */
  getter: (doc: D) => string;

  /**
   * Score boosting factor.
   */
  boost: number;
}

Example

const index = new DocumentIndex();

/**
 * Add a "title" field with a score boosting factor "1.4".
 */
index.addField("title", { boost: 1.4 });

/**
 * Add a "descriptionr" field.
 */
index.addField("description");

/**
 * Add a "body" field with a custom getter.
 */
index.addField("body", { getter: (doc) => doc.body });

Adding a document to an index

When all fields are registered, tokenizer and filter is set, Document Index is ready to index documents.

Document Index doesn't store documents internally to reduce memory footprint, all results will contain a docId that is associated with an added document.

add(docId: I, document: D) => void

Example

const index = new DocumentIndex();

/**
 * Add a "content" field.
 */
index.addField("content");

const doc = {
  "id": "12345",
  "content": "Lorem ipsum",
};

/**
 * Add a document with `doc.id` to an index.
 */
index.add(doc.id, doc);

Removing a document from an index

Remove method requires to know just document id to remove it from an index. When document is removed from an index, it is actually marked as removed, and when searching all removed documents will be ignored. vacuum() method is used to completely remove all data from an index.

remove(docId: I) => void

Example

const index = new DocumentIndex();

/**
 * Add a "content" field.
 */
index.addField("content");

const doc = {
  "id": "12345",
  "content": "Lorem ipsum",
};

/**
 * Add a document with `doc.id` to an index.
 */
index.add(doc.id, doc);

/**
 * Remove a document from an index.
 */
index.remove(doc.id);

Search with a free text query

Perform a search query with a free text, query will be preprocessed in the same way as all text fields with a registered tokenizer and filter. Each token separator will work as a disjunction operator. All terms will be expanded to find more documents, documents with expanded terms will have a lower score than documents with exact terms.

search(query: string) => SearchResult[]

interface SearchResult<I> {
  docId: I;
  score: number;
}

Example

const index = new DocumentIndex();

/**
 * Add a "content" field.
 */
index.addField("content");

const doc1 = {
  "id": "1",
  "content": "Lorem ipsum dolor",
};
const doc2 = {
  "id": "2",
  "content": "Lorem ipsum",
};

/**
 * Add two documents to an index.
 */
index.add(doc1.id, doc1);
index.add(doc2.id, doc2);

/**
 * Perform a search query.
 */
index.search("Lorem");
// => [{ docId: "2" , score: ... }, { docId: "1", score: ... } ]
//
// document with an id `"2"` is ranked higher because it has a `"content"` field with a less number of terms than
// document with an id `"1"`.

index.search("dolor");
// => [{ docId: "1" }]

Extending a term

Extend a term with all possible combinations starting from a term that is registered in an index.

extendTerm(term: string) => string[]

Example

const index = new DocumentIndex();

index.addField("content");

const doc1 = {
  "id": "1",
  "content": "abc abcde",
};
const doc2 = {
  "id": "2",
  "content": "ab de",
};

index.add(doc1.id, doc1);
index.add(doc2.id, doc2);

/**
 * Extend a term with all possible combinations starting from `"a"`.
 */
index.extendTerm("a");
// => ["ab", "abc", "abcde"]

index.extendTerm("abc");
// => ["abc", "abcde"]

index.extendTerm("de");
// => ["de"]

Converting query to terms

Convert a query to an array of terms with the same tokenizer and filters that are used in a DocumentIndex.

Converting queries are useful for implementing search highlighting feature.

queryToTerms(query: string) => string[]

Example

const index = new DocumentIndex();

index.addField("content");

const doc1 = {
  "id": "1",
  "content": "abc abcde",
};
const doc2 = {
  "id": "2",
  "content": "ab de",
};

index.add(doc1.id, doc1);
index.add(doc2.id, doc2);

/**
 * Convert a query to an array of terms.
 */
index.queryToTerms("a d");
// => ["ab", "abc", "abcde", "de"]

Serializing / deserializing the Index

For compatibility with browsers serialize/deserialize functions work with Uint8Array.

Default indexing options

const {writeFileSync, readFileSync} = require("fs");
const {serialize, deserialize, DocumentIndex} = require("ndx");

const index = new DocumentIndex();

index.addField("content");

const doc1 = {
  "id": "1",
  "content": "abc abcde",
};
const doc2 = {
  "id": "2",
  "content": "ab de",
};

index.add(doc1.id, doc1);
index.add(doc2.id, doc2);

const dumpFile = "./dump.msp";

// "serialize" call returns Uint8Array value
writeFileSync(dumpFile, serialize(index));

// "deserialize" call takes Uint8Array argument
const deserializedIndex = deserialize(readFileSync(dumpFile));
const query = "a d";

// prints "true"
console.log(
  JSON.stringify(index.search(query))
  ===
  JSON.stringify(deserializedIndex.search(query))
);

// prints "true"
console.log(
  JSON.stringify(index.queryToTerms(query))
  ===
  JSON.stringify(deserializedIndex.queryToTerms(query))
);

Custom indexing options

The library doesn't serialize functions. So if you use custom tokenizer, filter, field.getter functions, then you will need to pass these functions to deserialize function call. See example in src/tests/serialization.spec.ts test file.

Vacuuming

Vacuuming is a process that will remove all outdated documents from an inverted index.

When search is performed, all outdated documents will be automatically removed for all terms generated from a search query.

vacuum() => void

Example

const index = new DocumentIndex();

index.addField("content");

const doc = {
  "id": "12345",
  "content": "Lorem ipsum",
};

index.add(doc.id, doc);
index.remove(doc.id);

/**
 * Perform a vacuuming, it will remove all removed documents from an Inverted Index.
 */
index.vacuum();

Useful packages

Text Procesing

  • stemr is an optimized implementation of the Snowball English (porter2) stemmer algorithm.
  • Natural is a general natural language facility for nodejs. Tokenizing, stemming, classification, phonetics, tf-idf, WordNet, string similarity, and some inflections are currently supported.
  • stopword is a node module that allows you to strip stopwords from an input text.

Alternative Javascript Full-Text Search Engines

Package Sidebar

Install

npm i @vladimiry/ndx

Weekly Downloads

1

Version

0.5.1

License

MIT

Unpacked Size

144 kB

Total Files

62

Last publish

Collaborators

  • vladimiry