@chcaa/text-search-lite

0.15.0 • Public • Published

Text Search Lite

A full-text search engine with support for phrase, prefix, and fuzzy searches using the bm25f scoring algorithm. A build in mini query language is provided for advanced search features as well as a programmatic query interface. Aggregations and filters are supported as well.

Installation

npm install @chcaa/text-search-lite

Troubleshooting

If the build fails because of the node-snowball-package, try to install:

sudo apt-get install -y build-essential

// TODO what is the equivalent on Windows, some .NET build tools package?

Getting Started

Any POJO object with an id property (>= 1) can be indexed by text-search-lite. Documents (objects) are indexed in a SearchIndex instance which provides the main interface for adding, updating, deleting, and searching documents of the index. When creating a new SearchIndex the fields to search, sort, filter, or aggregate on must be defined in a schema definition for the SearchIndex to handle them correctly. All fields must define which type it should be indexed/stored as which determines how the values of the fields will be processed for searches, filtering, and aggregations. Additional options can be configured varying on the type of the field to further control what should be indexed and stored and how values should be processed. This will be discussed in detail in the Document Schema chapter.

import { SearchIndex } from '@chcaa/text-search-lite';

let persons = [
  { id: 1, name: 'Jane', gender: 'female', age: 54, hobbies: ['Cycling', 'Swimming'] },
  { id: 2, name: 'John', gender: 'male', age: 34, hobbies: ['Swimming'] },
  { id: 3, name: 'Rose', gender: 'female', age: 37, hobbies: [] }
];

let personsIndex = new SearchIndex([
  { name: 'name', type: SearchIndex.fieldType.TEXT },
  { name: 'gender', type: SearchIndex.fieldType.KEYWORD },
  { name: 'age', type: SearchIndex.fieldType.NUMBER },
  { name: 'hobbies', type: SearchIndex.fieldType.TAG, array: true }
]);

personsIndex.addAll(persons);

When the search index is created and have some documents added it can be searched using the search() method. The query to search for can either be expressed in a string-based query-string language or as a combination of different query objects. The query-string language will be used in the following examples.

// find all persons named "Jane", case does not matter
let janes = personsIndex.search('jane');

// find all female persons who can swim, "+" means the term must be present
let femalesWhoCanSwim = personsIndex.search('+female +swimming');

// narrow the search to only target specific fields
let femalesWhoCanSwimPrecise = personsIndex.search('+gender:(female) +hobbies:(swimming)');

// prefix search, wildcard single character, fuzzy search
let proximitySearch = personsIndex.search('J* 3? cyclist~');

The result of the queries will include an array of matching results with the id of the document and the relevance score of the document in relation to the query. If there are more than 10 results, only the first 10 results will be included (this can be controlled using the pagination option).

{
  results: [
    {
      id: 1,
      score: 0.4458314786416938
    }
  ],
  sorting: {
    field: "_score",
    order: "desc"
  },
  pagination: {
    offset: 0,
    limit: 10,
    total: 1
  },
  query: {
    queryString: "jane",
    errors: []
  }
}

To include the source object and/or a highlighted version of the source object the highlight and includeSource query options can be set. For includeSource or highlight to be able to resolve the source objects source.store must be enabled (the default) or the idToSourceResolver function must be configured in the query options.

let persons = [
  { id: 1, name: 'Jane', gender: 'female', age: 54, hobbies: ['Cycling', 'Swimming'] },
  { id: 2, name: 'John', gender: 'male', age: 34, hobbies: ['Swimming'] },
  { id: 3, name: 'Rose', gender: 'female', age: 37, hobbies: [] }
];
let personsById = new Map(); // this could be from a db/repository
persons.forEach(p => personsById.set(p.id, p));

// find all persons named "Jane" and highlight them
let janes = personsIndex.search('jane', {
  highlight: { enabled: true }
});

The result will for each document include a highlight.source property where the terms matching the search will be enclosed in html <mark> elements.

{
  results: [
    {
      id: 1,
      score: 0.4458314786416938,
      highlight: {
        source: {
          id: 1,
          name: "<mark>Jane</mark>",
          gender: "female",
          age: 54,
          hobbies: ["Cycling", "Swimming"]
        }
      }
    }
  ],
  // ...
}

Aggregations about all non text fields can be collected using the aggregations part of the queryOptions. (See more in the Aggregations chapter).

import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chca/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
  aggregations: [
    termAggregation('gender'),
    termAggregation('hobbies'),
    rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
  ]
});

When aggregations are requested, the result includes an array of aggregations with each of the requested aggregations. Term aggregations are sorted by docCount:DESC, term:ASC.

{
  results: [/*... */],
  aggregations: [
    {
      name: 'gender',
      fieldName: 'gender',
      type: "term",
      fieldType: "keyword",
      buckets: [
        { key: 'female', docCount: 2 },
        { key: 'male', docCount: 1 }
      ],
      missingDocCount: 0
    },
    {
      name: 'hobbies',
      fieldName: 'hobbies',
      type: "term",
      fieldType: "tag",
      buckets: [
        { key: 'swimming', docCount: 2 },
        { key: 'cycling', docCount: 1 }
      ],
      missingDocCount: 1 // person with id=3 does not have any hobbies
    },
    {
      name: 'age',
      fieldName: 'age',
      type: "range",
      fieldType: "number",
      buckets: [
        { key: '0-20', from: 0, to: 20, docCount: 0 },
        { key: '20-40', from: 20, to: 40, docCount: 2 },
        { key: '40-60', from: 40, to: 60, docCount: 1 },
        { key: '60-80', from: 60, to: 80, docCount: 0 },
        { key: '80-100', from: 80, to: 100, docCount: 0 }
      ],
      missingDocCount: 0
    }
  ]
}

The aggregations are only collected for the documents matching the search query and filters (if applied), so if we search for "jane" we only get aggregations for the documents matching this query.

// get aggregations about the documents matching the query
let all = personsIndex.search('jane', {
  aggregations: [
    termAggregation('gender'),
    termAggregation('hobbies'),
    rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
  ]
});
{
  results: [/*... */],
  aggregations: [
    {
      name: 'gender',
      fieldName: 'gender',
      type: "term",
      fieldType: "keyword",
      buckets: [
        { key: 'female', docCount: 1 }
      ],
      missingDocCount: 0
    },
    {
      name: 'hobbies',
      fieldName: 'hobbies',
      type: "term",
      fieldType: "tag",
      buckets: [
        { key: 'cycling', docCount: 1 },
        { key: 'swimming', docCount: 1 }
      ],
      missingDocCount: 0
    },
    {
      name: 'age',
      fieldName: 'age',
      type: "range",
      fieldType: "number",
      buckets: [
        { key: '0-20', from: 0, to: 20, docCount: 0 },
        { key: '20-40', from: 20, to: 40, docCount: 0 },
        { key: '40-60', from: 40, to: 60, docCount: 1 },
        { key: '60-80', from: 60, to: 80, docCount: 0 },
        { key: '80-100', from: 80, to: 100, docCount: 0 }
      ],
      missingDocCount: 0
    }
  ]
}

Filters, e.g., coming from user-selected facets (created from the aggregations) can be applied using the filters part of the queryOptions. Multiple filters must be combined into a single composite filter using a BooleanFilter which determines how the results each filter should be combined. Filters can be nested using BooleanFilter's in as many levels as needed.

import { greaterThanOrEqualFilter } from "@chcaa/text-search-lite/filter";
// get all persons with age >= 35
let all = personsIndex.search('', {
  filter: greaterThanOrEqualFilter('age', 35)
});

As we did a match all query (empty string) and only narrowed the results using a filter, the score for all documents is 0, as filters do not score results, only queries do. Furthermore, a match all query also changes the default sorting to id instead of _score because of this.

{
  results: [
    { id: 1, score: 0 },
    { id: 3, score: 0 }
  ],
  sorting: { field: 'id', order: 'asc' },
  pagination: { offset: 0, limit: 10, total: 2 },
  query: { queryString: '' }
}

The two filters below are combined using AND logic, meaning that a document must pass both filters to be included.

import { andFilter, greaterThanOrEqualFilter, termFilter } from "@chcaa/text-search-lite/filter";
// get all persons with age >= 35 who can swim
let all = personsIndex.search('', {
  filter: andFilter([
    greaterThanOrEqualFilter('age', 35),
    termFilter('hobbies', 'swimming')
  ])
});

Only a single person ("Jane") matches the filters.

{
  results: [
    { id: 1, score: 0 }
  ],
  // '''
}

Document Schema

Each document field to index for searching or/and to use for filtering, sorting, and aggregations must be defined as a part of the document schema for the SearchIndex. A field is defined as an object which always must have a name and a type and then depending on the type can have a set of additional properties set to further specify how the values for the field should be processed and stored.

The fields are passed to the SearchIndex as array including all the fields the SearchIndex should know about. Additionally, a schema options object for advanced configuration of search index can be passed as a second argument.

let personsIndex = new SearchIndex([
  { name: 'name', type: SearchIndex.fieldType.TEXT, },
  { name: 'gender', type: SearchIndex.fieldType.KEYWORD, index: false },
  { name: 'age', type: SearchIndex.fieldType.NUMBER, index: true },
  { name: 'hobbies', type: SearchIndex.fieldType.TAG, array: true, docValues: false }
], {
  schema: { // general config of the schema
    analyzer: SearchIndex.analyzer.LANGUAGE_ENGLISH,
    score: { // ADVANCED change the default settings of scoring algorithm
      k1: 1.5,
    }
  }
});

Schema options

The schema part of the options-object can be used to change the default general settings of the schema. All properties are optional, so the argument can be left out if no change is required.

  • analyzer: string - The name of the default analyzer to use for text fields. text-search-lite has a set of common analyzers built-in for different languages which is accessible through SearchIndex.analyzer or a custom analyzer can be installed and used. Defaults to 'standard'.
  • sorting: object - Sorting config.
    • locale: string - The default Unicode locale to use for sorting text-like fields. The language part is required, the region is optional. Defaults to en-US.
  • score: object - Scoring config.
    • k1: number - k1 for the bm25f scoring algorithm. k1. The default value to use for all analyzers not having k1 assigned specifically. Defaults to 1.2.
    • analyzerK1: object - k1 for individual analyzers. Register k1 for a specific analyzer by using the name of the analyzer as property name and set k1 as the value.

Field settings

The following properties are available for all field types:

  • name: string - The path name of the field. E.g. "author.name" (for arrays of values the brackets should be excluded, e.g. "authors.name").
  • type: ('text'|'keyword'|'tag'|'number'|'date'|'boolean') - The type of the field.
  • index?: boolean - Set to true if the field should be searchable.
  • docValues?: boolean - Set to true if the field should be available in filters or be used for aggregations.
  • array?: boolean - Set to true if the property is an array or if it's a descendant of an array. Defaults to false.
  • boost?: number - The relevance of the field when scoring it in a search. Must be >= 1. Defaults to 1.
  • prefix?: object - Prefix config. Only relevant if index=true.
    • eagerLoad?: boolean - Set to true if prefix mappings should be eagerly loaded. If false prefixes will first be loaded when queried on the field. Defaults to false.
    • partitionDepth?: number - The maximum partition depth of the prefix tree. In most cases, the default is fine only in cases where, e.g., all analyzed terms start with the same prefix such as "000-SomeValue" this should be set to a higher number. Defaults to 3.
  • fuzzy?: object - Fuzzy config. Only relevant if index=true.
    • enabled?: boolean - true if fuzzy queries should be supported. Default to true.
  • score?: object - The scoring paramters to use when calculating the score of the field. Only relevant if index=true.
    • b?: number - The hyperparameter b of bm25f. Defaults to 0.75.
  • docStats?: object - Document statistics config. Only relevant if index=true.
    • length?: boolean - Should doc field length be stored. Defaults to false.
    • termFrequencies?: boolean - Should doc field length be stored. Defaults to false.
    • termPositions?: boolean - Should doc term positions be stored. Only relevant for text fields. This is required for doing phrase searches. Defaults to false.
    • sorting: boolean|object - Primarily used for enabling sorting on fields of type text as all other field types are sortable if they have docValues=true.
      • locale: string - The Unicode locale to use for sorting text-like fields. Overrides the default setting taken from the schema options.
      • transform: function - A custom transform function to convert the value to the value to sort on. Must return the same type as the input type.

Some properties are only available for specific field types or have another value than the default. The additional fields and different default values are described for each field below.

A note on docStats

Even though docStats can be enabled for all field types it does not make sense for other fields than text as all other field types are not tokenized. The only exception is if a field has array=true and the number of elements in the array should be taken into account when scoring the document.
E.g., if we have documents with a hobbies array where doc-1 has ["bicycling"] and doc-2 has ["bicycling", "climbing"] and documents with more hobbies should score lower than documents with fewer hobbies then docStats.length could be se to true as the number of hobbies then will be used in calculating how relevant the match is.

text Fields

A text field is the primary field to use for full-text searches. A text field is analyzed (normalized and tokenized) when indexed which makes it ideal for efficient lookup of terms and phrases in the text of the field.

docValues not allowed
A text field cannot have docValues=true because it is tokenized. Therefore, a text field cannot be used for filtering, aggregations, and sorting.

Default settings override

  • index: true
  • docStats
    • length: true
    • termFrequencies: true
    • termPositions: true

Text field-specific settings

  • indexExact?: boolean - Set to true for text fields to enable phrase searches and more precise matching. Defaults to true.
  • analyzer?: string - The name of the analyzer to use for this field. Defaults to undefined which resolves to the default analyzer configured for the search index.

Sorting

If sorting=true the first 50 characters if the original text will be lowercased and used for sorting. To change the default behavior a custom transform function can be supplied instead e.g. sorting: { transform: v => v.substring(0, 20).toUpperCase() }.

Sorting and memory usage
Enabling sorting for text fields creates a new hidden index which will consume extra memory, so only enable sorting for text fields which is actually needed.

keyword Fields

A keyword field is indexed as is without applying any form of analysis. To match the value of a keyword field the same string as when the field was indexed must be used.

Default settings override

  • index: true
  • docValues: true

tag Fields

A tag field is indexed in the same way as a keyword field except that lowercase is applied making the value of the field case-insensitive.

Default settings override

  • index: true
  • docValues: true

number Fields

A number field is used to store numeric values such as age, weight, length, and other measures and is typically used for filtering and sorting but can be indexed as well (disabled by default).

Default settings override

  • index: false
  • docValues: true
  • fuzzy
    • enabled: false

number fields should in most cases only be indexed if the vocabulary is relatively small and made up of integers. A large vocabulary either because of floating point numbers or large scale integers will be hard to match in a search and would furthermore result in a large inverted index.

date Fields

A date field is used to store date and date-time values and is typically used for filtering and sorting but can be indexed as well (disabled by default).

Default settings override

  • index: false
  • docValues: true
  • fuzzy
    • enabled: false

Date field-specific settings

  • format: string - A format string in one of the formats yyyy, yyyy-MM-dd or yyyy-MM-dd'T'HH-mm-ssZ.

Document value types

The value of the document field can express a date in one of the following ways:

  • number - An integer in epoch millis. Negative values are allowed.
  • string - A date string if field.format is defined.

Dates in BC time can for all string formats be defined as negative values with 6-year digits: -yyyyyyy. E.g. -000001-01-01

When format is defined both epoch millis and date strings in the defined format is allowed as values for the field. If docValues is enabled for the field the date will be converted to the epoch millis version before storing it, which then will be used for filtering, aggregations, and sorting. If index is enabled for the field the date will be converted to the defined string format before indexing the date, so the date can be searched for using the given format.

A date fields can only be indexed if the format precision is set to yyyy or yyyy-MM-dd. A large vocabulary because of minute, second, or even millisecond precision will be hard to match in a search and would furthermore result in a large inverted index.

Regarding Time Zones
Dates will internally always be stored as UTC. If date inputs include time using the yyyy-MM-dd'T'HH-mm-ssZ format and no time zone is present, the date will be parsed as UTC.

boolean Fields

A boolean field is used to store the boolean values true|false and is typically used for filtering and sorting but can be indexed as well (disabled by default).

Default settings override

  • index: false
  • docValues: true
  • fuzzy
    • enabled: false

docId Fields

A special field type used for storing the id of a document in an optimized way. The field type cannot be configured on user-defined fields but can still be encountered as the id field is publicly available.

Create, Update and Delete Documents

Creating, updating, and deleting documents can be done using the following methods.

All documents must have an id property with an integer value > 0.

Method: add(document)

Adds a document to the index. If the document already exists, an error will be thrown.

Parameters:

  • document: object - The document to add.

Method: addAll(documents)

Adds all documents to the index. If one of the documents already exists, an error will be thrown.

This method is performance optimized for adding many documents at once.

Parameters:

  • documents: object[] - The documents to add.

Method: update(document)

Updates an existing document. If the document does not exist, the document will be added.

Parameters:

  • document: object - The document to update.

Method: remove(document)

Removes the document from the index.

Parameters:

  • document: object - The document to remove.

Method: removeById(id)

Removes the document with the id from the index.

Parameters:

  • id: number - The id of the document to remove.

Searching

Searching the index is done using the search() method. The query part of the search can be expressed in the built-in query string language or as a combination of query objects. Furthermore, filters, aggregations, sorting, and pagination can be applied/requested through the optional queryOptions object which can be passed as a second argument to search().

Method: search(query, [queryOptions])

Searches the index.

Parameters:

  • query: string|Query|Query[] - The query to search for.
  • queryOptions?: object - Query options.
    • fields?: string[] - The name of the fields to search. Defaults to all user-created indexed fields if not defined.
    • pagination?: object - The pagination to apply.
      • offset?: number - The pagination offset. Defaults to 0.
      • limit?: number - The pagination limit. Defaults to 10.
    • sorting?: object - The sorting to apply.
      • field?: string - The field to sort by or "_score".
      • order?: ('asc'|'desc') - The sorting order.
    • filter?: Filter - The filter to apply.
    • aggregations?: Aggregation[] - The aggregations to generate.
    • highlight?: boolean|object - Highlight options. Defaults to false.
      • enabled: boolean - Should highlight be enabled.
      • prefix?: object - Prefix highlight options.
        • expand?: boolean - Set to true if the matched term should be fully highlighted and false if only the prefix part should.
    • includeSource?: boolean - - Should the source object be included in the result. Defaults to false.
    • idToSourceResolver?: function(number[]):{id:number}[] - A function for resolving source objects from an array of ids. If configured, this function will be used to resolve source objects for the query instead of the source objects originally indexed.
    • queryString?: object - Query string options.
      • parseOptions: object - Query string parse options. Enable/disable which query-string expressions to parse.
      • defaultOccurrence: ('should'|'must'|'mustNot') - The default occurrence to use when no occurrence modifier is set. Defaults to 'should'.

See also the "search index options" section of the search index configuration chapter for configuring custom default query options.

Returns:

  • object - The result of the search.
    • results: object[] - Information about each document matching the search and applied filters and pagination.
      • results[].id: number - The id of the document.
      • results[].score: number - The relevance score of the document. (This will be 0 in the case of sorting on something else than _score or if a match-all query is performed).
      • results[].source: object - The source object if requested in the queryOptions.
      • results[].highlight: object - Highlight information if requested in the queryOptions.
      • results[].highlight.source: object - A highlighted version of the source object.
      • sorting: object - The sorting applied to the result.
      • pagination: object - The pagination applied to the result and the total number of matches.
        • offset: number
        • limit: number
        • total: number - The total match count.
      • aggregations: object - The aggregation results. (See aggregations for the different result object structures).
      • query: object - The query-string and possible errors. This is only available of the query was performed using a query-string.
        • queryString: string - The query-string used for the search.
        • errors: object[] - The parse errors, if any, which occurred during parsing of the query-string.

Query String Language

Text-search-lite has a built-in query-string mini language for expressing text-based queries with support for expressing the same types of queries as the programmatic API does, such as boolean modifiers, phrases, wildcards, targeting specific fields, and grouping of statements. The query-string parser automatically converts any unparsable part of the query to regular "text" making it safe to expose the query-string language directly to the end-user.

The following modifiers and expressions are supported and can be turned on/off individually to limit what should be parsed and what should just be treated as regular text.

Phrases "A phrase"

A phrase is one or more terms in a specific form, which should be present in a particular order.

search for "a full sentence" or for a "single" specific spelling of a term

Must Operator +

The term, phrase, or group content must be present in the document for it to match.

+peace in the +world

Must Not Operator -

The term, phrase, or group content must not be present in the document for it to match.

peace not -war

Boost Operator ^NUMBER Boost the relevance of the term, phrase or the content of a group.

peace^10 "love not war"^2

Prefix Operator *

The term must start with one or more characters, but the ending is undetermined. Prefix queries take the difference in length between the match and the prefix string into account when scoring is calculated.

love and pea*

Wildcard Operator ?, *

The term can have single and multiple character spans which are undetermined. The single character wildcard is expressed by ? and multiple character wildcard is expressed by *. Wildcard queries take the difference in length between the match and the prefix string into account when scoring is calculated.

love and p?a*e

Fuzzy Operator ~, ~[0, 1, 2]

The term must match other terms within a maximum edit distance. When the edit distance is not defined specifically by one of [0, 1, 2] the edit distance is calculated based of the length of the term.

  • length < 3: maxEdits = 0
  • length < 6: maxEdits = 1
  • length >= 6: maxEdits = 2

Fuzzy queries take the edit distance between the term and the result into account when scoring is calculated.

love~ and peace~2

Groups ()

Terms and phrases can be grouped together, and boolean operators and boost can be applied to a group, making it possible to express more complex queries.

+peace +(world earth) (love solidarity)^10

Field Groups FIELD1:FIELD2:()

Field groups offer the same possibilities as groups and additionally target one or more fields where the match must occur. Multiple fields must be separated by a colon (:).

Field Groups cannot be nested.

title:(world earth) title:description:(love solidarity)

Query String Options

The parsing of the query-string language can be configured in the queryOptions of search() and parseQueryStringToQueryObjects(). Where each language feature can be enabled/disabled. All features are enabled by default.

  • queryString: object - The Query string options.
    • parseOptions: object - Enable/disable which query-string expressions to parse.
      • quote: boolean - Toggle parsing of "exact strings and phrases".
      • group: boolean - Toggle parsing of (terms in group).
      • fieldGroup: boolean - Toggle parsing of title:(terms in field group).
      • mustOperator: boolean - Toggle parsing of +mustOperator.
      • mustNotOperator: boolean - Toggle parsing of -mustNotOperator.
      • prefixOperator: boolean - Toggle parsing of prefix*.
      • wildcardOperator: boolean - Toggle parsing of wil_c*d.
      • fuzzyOperator: boolean - Toggle parsing of fuzzy~1.
      • boostOperator: boolean - Toggle parsing of boost^10.
    • defaultOccurrence: ('should'|'must'|'mustNot') - The default occurrence to use when no occurrence modifier is set. Defaults to 'should'.

Parse Errors

When using the query-string language the search() and parseQueryStringToQueryObjects() methods includes information about any parse errors and their exact location in the string. The parse errors are structured in the following format.

  • errors: object[] - An array of error objects.
    • errors[].type: string - The type of the error.
    • errors[].message: string - A user-friendly message.
    • errors[].startIndex: number - The start index in the source string where the reported error occurs.
    • errors[].spanSize: number - The character span of the reported error.

The query-string language can also be validated directly using the validateQueryString() method which e.g., could be used for user feedback while typing a query.

Method: validateQueryString(queryString, [parseOptions])

Validates the query string. Any problems with the query string will be reported in the errors array of the returned object.

Parameters:

  • queryString: string - The query string to validate.
  • parseOptions: object - Options for configuring which parts of the query string language should be enabled. (See Query String Options).

Returns:

  • object - The result of the validation.
    • status: ('success'|'error') - The status of the validation.
    • errors: object[] - The parse errors which occurred during parsing. (See above).
    • queryString: string - The query-string which was validated.

Query Objects

The query-language described in the previous chapter is converted into a combination of query objects which can also be created programmatically.

A single query object or an array of query objects can be passed as the query to the search method of a SearchIndex instance.

import { prefixQuery, termQuery, fieldGroupQuery, Query } from "@chcaa/text-search-lite/query";
// get all persons with age >= 35 who can swim
let query = prefixQuery('jo');
let startingWithJo = personsIndex.search(query);

let query2 = [termQuery('cycling'), termQuery('climbing')];
let withOneOfHobbies = personsIndex.search(query2);

let query3 = [termQuery('cycling', Query.occurrence.MUST), termQuery('climbing', Query.occurrence.MUST)];
let withBothHobbies = personsIndex.search(query3);

let query4 = fieldGroupQuery(['hobbies'], [termQuery('cycling'), termQuery('climbing')]);
let withOneOfHobbiesInField = personsIndex.search(query4);

To convert from the query-string language to the equivalent query objects the SearchIndex exposes the method parseQueryStringToQueryObjects() making it possible to express the initial part of a query in the query-string language and then further modify the query (add, replace etc.) using query objects.

Factory Functions

Factory functions for creating the different kinds of query objects are exported from the @chcaa/text-search-lite/query package along with the filter classes the factory functions produces. The factory functions are the suggested way for creating queries where the classes can be used for type definitions.

Function: termQuery(term, [occurrence], [boost])

Creates a new TermQuery for matching terms/tokens in a document.

The term (text) for the query will be analyzed using the analyzer of the field before performing the query. E.g., If the field to search is a tag the term will be lower-cased, if the field is a text field, the term will be normalized and tokenized.

When searching text fields, all the tokens in the term will need to be present in a document for the query to match. So a search for the term "They went for a walk" will be analyzed to something like ['they', 'went', 'for', 'a', 'walk'] which then all will be matched against the field of each document and only include the documents with all the tokens in the field. So to be able to search for documents only containing one or some of the tokens, the term should be split into smaller queries, typically splitting on whitespace.

Term queries can also be used on non-text fields such as number, boolean and date and can be passed number and boolean type values. If the passed in value is a supported non-text type, it will be transformed to the correct indexed version of the value before querying, e.g., (boolean: true → "true"), (number: 1000 → "1000"), (date: 0 → "1970-01-01").

Parameters:

  • term: string|number|boolean - The term to search for.
  • occurrence?: ("should"|"must"|"mustNot") - The occurrence of the term. Defaults to "should".
  • boost?: number - The boost to multiply the score of the term with when scoring the matching documents. Defaults to 1.

returns:

  • TermQuery

Function: phraseQuery(phrase, [occurrence], [boost])

Creates a new PhraseQuery for matching documents containing a phrase.

Parameters:

  • phrase: string - The phrase to search for.
  • occurrence?: ("should"|"must"|"mustNot") - The occurrence of the term. Defaults to "should".
  • boost?: number - The boost to multiply the score of the term with when scoring the matching documents. Defaults to 1.

returns:

  • PhraseQuery

Function: prefixQuery(term, [occurrence], [boost])

Creates a new PrefixQuery for matching documents with terms starting with a prefix.

Parameters:

  • term: string - The prefix-term the matched terms should start with.
  • occurrence?: ("should"|"must"|"mustNot") - The occurrence of the term. Defaults to "should".
  • boost?: number - The boost to multiply the score of the term with when scoring the matching documents. Defaults to 1.

returns:

  • PrefixQuery

Function: wildcardQuery(term, [occurrence], [boost])

Creates a new WildcardQuery for matching documents on a term with wildcards.

  • ? matches a single character.
  • * matches 0-n characters.

The wildcard-term cannot start with a wildcard.

Parameters:

  • term: string - The wildcard-term the matched terms should match.
  • occurrence?: ("should"|"must"|"mustNot") - The occurrence of the term. Defaults to "should".
  • boost?: number - The boost to multiply the score of the term with when scoring the matching documents. Defaults to 1.

returns:

  • WildcardQuery

Function: fuzzyQuery(term, [occurrence], [boost])

Creates a new FuzzyQuery for matching documents matching an expanded (fuzzy) term.

A fuzzy query expands the term up to a maximum of 2 edit distances based on the Levenshtein edit distance and uses the expanded terms to perform an OR query on the fields to search.

If the limit of maxTopTermExpansionsPerField is exceeded, the top terms will be selected based on simple idf relevance (docCount/df) and edit distance.

The auto maxEdit distance (the default) has the following values:

  • length < 3: maxEdits = 0
  • length < 6: maxEdits = 1
  • length >= 6: maxEdits = 2

Parameters:

  • term: string - The term to search for variants of.
  • maxEdits?: (-1|0|1|2) - The maximum edits allowed. Auto (-1) determines the max edits based on the length of the initial term. Defaults to -1.
  • occurrence?: ("should"|"must"|"mustNot") - The occurrence of the term. Defaults to "should".
  • boost?: number - The boost to multiply the score of the term with when scoring the matching documents. Defaults to 1.
  • maxTopTermExpansionsPerField?: number - The maximum expansions to include per field. Defaults to 50.

returns:

  • FuzzyQuery

Function: groupQuery(children, [occurrence], [boost])

Creates a new GroupQuery for matching documents fulfilling a group of queries.

Parameters:

  • children: Query[] - The queries this query should combine based on their occurrence.
  • occurrence?: ("should"|"must"|"mustNot") - The occurrence requirement of the group (if e.g., included in a parent group). Defaults to "should".
  • boost?: number - The boost to multiply the score of the children of the group with when scoring the matching documents. Defaults to 1.

returns:

  • GroupQuery

Function: fieldGroupQuery(fieldNames, children, [occurrence], [boost])

Creates a new FieldGroupQuery for matching documents fulfilling a group of queries across one or more fields.

Parameters:

  • fieldNames: string[] - The field names the children of this group should be matched against.
  • children: Query[] - The queries this query should combine based on their occurrence.
  • occurrence?: ("should"|"must"|"mustNot") - The occurrence requirement of the group (if e.g., included in a parent group). Defaults to "should".
  • boost?: number - The boost to multiply the score of the children of the group with when scoring the matching documents. Defaults to 1.

returns:

  • FieldGroupQuery

Function: matchAllQuery()

Creates a new MatchAllQuery matching all documents in the index.

returns:

  • MatchAllQuery

Filters

Filters can be used to narrow down the search result (or the full dataset) on any field with docValues=true. Various filters are provided for filtering on the different field types, and user-defined filters can be defined as well if needed.

To apply multiple filters, the filters must be combined into a single composite filter using a BooleanFilter which determines how the results each filter should be combined. Filters can be nested using BooleanFilter's in as many levels as needed.

Filters are applied in the queryOptions object.

import { andFilter, greaterThanOrEqualFilter, termFilter } from "@chcaa/text-search-lite/filter";
// get all persons with age >= 35 who can swim
let all = personsIndex.search('', {
  filter: andFilter([
    greaterThanOrEqualFilter('age', 35),
    termFilter('hobbies', 'swimming')
  ])
});

Caching of Filters

Most predicate filters are cacheable such as TermFilter, RangeFilter and PrefixFilter for reuse of their result but differ in whether caching is enabled by default or not based on their presumed use-case (consult each filter documentation for its cache settings).

A rule of thumb:

  • If the filter is expected to be reused, e.g., is a fixed range used for facets or another predefined range, turn caching on.
  • If the filter values vary a lot, e.g., is user defined with many possibilities, turn caching off.

If caching is turned off, add the filter after any other filters that have caching enabled, to minimize the numbers of, per document, calculation of the filter.

Filter Types

The following filters are provided and described in detail in the next chapter.

Predicate filters: CustomFilter, DateRangeFilter, PrefixFilter, RangeFilter, RegexFilter, TermFilter.

Logical and special filters: BooleanFilter, ExistsFilter, IdsFilter.

Factory Functions

Factory functions for creating the different kinds of filters are exported from the @chcaa/text-search-lite/filter package along with the filter classes the factory functions produces. The factory functions are the suggested way for creating filters where the classes can be used for type definitions.

Function: termFilter(fieldName, term, [options])

Creates a new TermFilter for filtering on keyword, tag, number, date, and boolean fields.

Caching of the filter if disabled by default, as the search index can be used directly as cache and will be used instead. Only in cases where index=false and caching is required, caching should be set to true.

Parameters:

  • fieldName: string - The name of the field to filter on.
  • term: string|number|boolean - The term to filter on.
  • options?: object - Config options.
    • cache?: boolean - Set to true if the filter should be cached and index=false for the field. Defaults to false.

returns:

  • TermFilter

Function: rangeFilter(fieldName, minValue, maxValue, [options])

Creates a new RangeFilter for filtering documents on the presence of a range of number og characters.

Only one of minValue or maxValue is required, making it possible to express greater-than and less-than filters.

Caching of the filter if enabled by default, but in cases where the ranges can vary a lot e.g., by user-defined min and max values the cache should be turned off as there is an initial overhead in calculating the filter when caching is enabled as it is calculated on all documents (for reusability) instead of just the documents matching the query + previous filters which is the case when caching is turned off.

If caching is turned off, add the filter after any other filters that have caching enabled, to minimize the numbers of, per document, calculations of the filter.

Parameters:

  • fieldName: string - The name of the field to filter on.
  • minValue: number|string - The minimum value to accept (default inclusive).
  • maxValue: number|string - The maximum value to accept (default exclusive).
  • options?: object - Config options.
    • minValueInclusive: boolean - Set to true if the minValue should be inclusive. Defaults to true.
    • maxValueInclusive: boolean - Set to true if the maxValue should be inclusive. Defaults to false.
    • cache?: boolean - Set to true if the filter should be cached. Defaults to true.

returns:

  • RangeFilter

Additionally, a set of convenience functions is supplied:

  • rangeFilterMaxValueInclusive(fieldName, minValue, maxValue) - Creates a new RangeFilter with maxValueInclusive=true.
  • greaterThanFilter(fieldName, minValue) - Creates a new RangeFilter with maxValue=undefined and minValueInclusive=false.
  • greaterThanOrEqualFilter(fieldName, minValue) - Creates a new RangeFilter with maxValue=undefined and minValueInclusive=true.
  • lessThanFilter(fieldName, maxValue) - Creates a new RangeFilter with minValue=undefined and maxValueInclusive=false.
  • lessThanOrEqualFilter(fieldName, maxValue) - Creates a new RangeFilter with minValue=undefined and maxValueInclusive=true.

Function: dateRangeFilter(fieldName, format, minDate, maxDate, [options])

Creates a new RangeFilter for filtering documents on the presence of a range of dates.

Only one of minDate or maxDate is required, making it possible to express greater-than and less-than filters.

Caching of the filter if enabled by default, but in cases where the ranges can vary a lot e.g., by user-defined min and max dates, the cache should be turned off as there is an initial overhead in calculating the filter when caching is enabled as it is calculated on all documents (for reusability) instead of just the documents matching the query + previous filters which is the case when caching is turned off.

If caching is turned off, add the filter after any other filters that have caching enabled, to minimize the numbers of, per document, calculations of the filter.

Parameters:

  • fieldName: string - The name of the field to filter on.
  • format: string - The format of minDate and maxDate. E.g. yyyy-MM-dd.
  • minValue: number|string - The minimum value to accept (default inclusive).
  • maxValue: number|string - The maximum value to accept (default exclusive).
  • options?: object - Config options.
    • minDateInclusive: boolean - Set to true if the minDate should be inclusive. Defaults to true.
    • maxDateInclusive: boolean - Set to true if the maxDate should be inclusive. Defaults to false.
    • cache?: boolean - Set to true if the filter should be cached. Defaults to true.

returns:

  • DateRangeFilter

Additionally, a set of convenience functions is supplied:

  • dateRangeFilterMaxDateInclusive(fieldName, format, minDate, maxDate) - Creates a new DateRangeFilter with maxDateInclusive=true.
  • greaterThanDateFilter(fieldName, format, minDate) - Creates a new DateRangeFilter with maxDate=undefined and minDateInclusive=false.
  • greaterThanOrEqualDateFilter(fieldName, format, minDate) - Creates a new DateRangeFilter with maxDate=undefined and minDateInclusive=true.
  • lessThanDateFilter(fieldName, format, maxDate) - Creates a new DateRangeFilter with minValue=undefined and maxDateInclusive=false .
  • lessThanOrEqualDateFilter(fieldName, format, maxDate) - Creates a new DateRangeFilter with minDate=undefined and maxDateInclusive=true.

Function: prefixFilter(fieldName, prefix, [options])

Creates a new PrefixFilter for filtering documents on the presence of a given term prefix in the document.

Caching of the filter if disabled by default as it is expected that the prefix will vary a lot and there is an initial overhead in calculating the filter when caching is enabled as it is calculated on all documents (for reusability) instead of just the documents matching the query + previous filters, which is the case when caching is turned off.

If caching is turned off, add the filter after any other filters that have caching enabled, to minimize the numbers of, per document, calculations of the filter.

Parameters:

  • fieldName: string - The name of the field to filter on.
  • prefix: string - The prefix to filter on.
  • options?: object - Config options.
    • cache?: boolean - Set to true if the filter should be cached. Defaults to false.

returns:

  • PrefixFilter

Function: regexFilter(fieldName, regex, [options])

Creates a new RegexFilter for filtering documents on the presence of a given regex pattern in the document.

Caching of the filter if enabled by default, but in cases where the regex can vary a lot e.g., by a user defined regex the cache should be turned off as there is an initial overhead in calculating the filter when caching is enabled as it is calculated on all documents (for reusability) instead of just the documents matching the query + previous filters which is the case when caching is turned off.

If caching is turned off, add the filter after any other filters that have caching enabled, to minimize the numbers of, per document, calculations of the filter.

Parameters:

  • fieldName: string - The name of the field to filter on.
  • regex: RegExp - The regex to filter on.
  • options?: object - Config options.
    • cache?: boolean - Set to true if the filter should be cached. Defaults to true.

returns:

  • RegexFilter

Function: customFilter(fieldName, predicate)

Creates a new CustomFilter for filtering documents on the result of a predicate function.

This filter is not cacheable as the predicate function cannot be guaranteed to produce the same result based on the same input because the predicate functions algorithm can rely on changing variables, time, etc.

Parameters:

  • fieldName: string - The name of the field to filter on.
  • predicate: function(value):boolean - A predicate function which is passed each value from the field and should return true|false if the document with the value should be included or not.

returns:

  • CustomFilter

Function: existsFilter(fieldName)

Creates a new ExistsFilter which tests for existence of a value for the given field. The field exists if the value is not null, undefined or [].

Parameters:

  • fieldName: string - The name of the field to filter on.

returns:

  • ExistsFilter

Function: idsFilter(ids)

Creates a new IdsFilter for filtering documents on their id. The filter accepts an Iterable of ids.

Parameters:

  • ids: Iterable<number> - The ids of the documents to include.

returns:

  • IdsFilter

Function: booleanFilter(filters, booleanOperator)

Creates a new BooleanFilter for combining multiple Filter instances results with one of AND, OR, AND_NOT logic.

The AND_NOT filter subtracts the result of its filters from the parent filter's (or query's) results.

In most cases using the convenience functions andFilter, orFilter, andNotFilter is both easier and more expressive in terms of intent.

Parameters:

  • filters: Filter[] - The filters to combine the results of with the passed in operator logic.
  • booleanOperator: ("and"|"or"|"andNot") - The boolean operator logic to combine the filters with. A boolean operator enum is available at BooleanFilter.operator.

returns:

  • BooleanFilter

Aggregations

Aggregations can be used to collect aggregated statistics about the result of a query. This could, e.g., be:

  • the top 10 hobbies of documents
  • number of documents grouped by age ranges
  • number of documents grouped by birth-year decade
  • etc.

Multiple aggregations can be requested at the same time, and aggregations can be nested to create drill-down detail hierarchies.

To request one or more aggregation include the aggregations as part of the queryOptions object.

import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chcaa/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
  aggregations: [
    termAggregation('gender'),
    termAggregation('hobbies', 2),
    rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
  ]
});

The results of the requested aggregations are included as an array on the result object from the query. All aggregation results have the same set of base properties where only the bucket objects differ depending on the type of aggregation requested.

{
  results: [/*... */],
  aggregations: [
    {
      name: 'gender',
      fieldName: 'gender',
      type: "term",
      fieldType: "keyword",
      buckets: [
        { key: 'female', docCount: 2 },
        { key: 'male', docCount: 1 }
      ],
      totalBucketCount: 4, // the total number of possible buckets (unique terms) 
      missingDocCount: 0,
    },
    {
      name: 'hobbies',
      fieldName: 'hobbies',
      type: "term",
      fieldType: "tag",
      buckets: [
        { key: 'swimming', docCount: 2 },
        { key: 'cycling', docCount: 1 }
      ],
      missingDocCount: 1 // person with id=3 does not have any hobbies
    },
    {
      name: 'age',
      fieldName: 'age',
      type: "range",
      fieldType: "number",
      buckets: [
        { key: '0-20', from: 0, to: 20, docCount: 0 },
        { key: '20-40', from: 20, to: 40, docCount: 2 },
        { key: '40-60', from: 40, to: 60, docCount: 1 },
        { key: '60-80', from: 60, to: 80, docCount: 0 },
        { key: '80-100', from: 80, to: 100, docCount: 0 }
      ],
      missingDocCount: 0
    }
  ]
}

Peculiarities of aggregation buckets
As documents with array fields can occur more than once in the aggregated statistics, the sum of the counted document values may exceed to total number of documents in the query. This is expected.

Factory Functions

Factory functions for creating the different kinds of aggregations are exported from the @chcaa/text-search-lite/aggregation package along with the aggregation classes the factory function produces. The factory functions are the suggested way for creating aggregation requests where the classes can be used for type definitions.

Function: termAggregation(fieldName, [maxSize], [options])

Creates a new TermAggregation for collecting statistics about keyword, tag, number, date, and boolean fields. The occurrence of each distinct value will be counted once per document and returned descending with the value with most documents at the top.

OBS
When a filter is set in the options-object, caching will be disabled for the aggregation results.

Parameters:

  • fieldName: string - The name of the field to aggregate on.
  • maxSize?: number - The maximum number of buckets. Defaults to 10.
  • options?: object - See general config options in Config options.
    • filter: PredicateFilter - A predicate filter for filtering the terms to include in the aggregation.

returns:

  • TermAggregation

Bucket results

Buckets are sorted by docCount:DESC, term:ASC.

  {
  // name, fieldName, etc...
  buckets: [
    { key: 'female', docCount: 2 },
    { key: 'male', docCount: 1 }
  ],
  totalBucketCount: 4, // the total number of possible buckets (unique terms) 
  missingDocCount: 0
}

Filter example

Filter bucket keys (terms) using a PrefixFilter so we only get buckets where the key is starting with a "c".

import { termAggregation } from "@chcaa/text-search-lite/aggregation";
import { prefixFilter } from "@chcaa/text-search-lite/filter";

let all = personsIndex.search('', {
  aggregations: [
    termAggregation('hobbies', 10, {
      filter: prefixFilter('hobbies', 'c')
    })
  ]
});

Function: rangeAggregation(fieldName, ranges, [options])

Creates a new RangeAggregation for collecting statistics about number, keyword and tag fields.

Parameters:

  • fieldName: string - The name of the field to aggregate on.
  • ranges: object[] - The ranges to create buckets for.
    • ranges[].from: number|string - The lower limit of the bucket, inclusive. Optional for the first range if no lower limit is required.
    • ranges[].to: number|string -The upper limit of the bucket, exclusive. Optional for the last range if no upper limit is required.
  • options?: object - Config options.

returns:

  • RangeAggregation

Bucket results

Buckets are sorted by the order they were requested.

  {
  // name, fieldName, etc...
  buckets: [
    { key: '0-20', from: 0, to: 20, docCount: 0 },
    { key: '20-40', from: 20, to: 40, docCount: 2 },
    { key: '40-60', from: 40, to: 60, docCount: 1 },
  ],
  missingDocCount: 0
}

Additionally, a set of convenience functions is supplied:

  • rangeAggregationWithIntegerAutoBuckets(fieldName, bucketCount, min, max, [options]) - Creates a new RangeAggregation where the buckets are auto generated based on the input parameters.
  • rangeAggregationWithIntegerAutoBucketsOpenEnded(fieldName, bucketCount, min, max, [options]) - Creates a new open-ended RangeAggregation where the buckets are auto generated based on the input parameters. The first bucket will only have to defined and the last bucket only from defined and the bucket ranges is thus open-ended.
  • rangeAggregationWithNumberAutoBuckets(fieldName, bucketCount, min, max, [options]) - Creates a new RangeAggregation where the buckets are auto generated based on the input parameters.
  • rangeAggregationWithNumberAutoBucketsOpenEnded(fieldName, bucketCount, min, max, [options]) - Creates a new open-ended RangeAggregation where the buckets are auto generated based on the input parameters. The first bucket will only have to defined and the last bucket only from defined and the bucket ranges is thus open-ended.

Function: dateRangeAggregation(fieldName, ranges, [options])

Creates a new DateRangeAggregation for collecting statistics about date fields. Date range aggregations work in the same way as range aggregations except that the bucket ranges can be expressed in a string date format.

Parameters:

  • fieldName: string - The name of the field to aggregate on.
  • format: string - The date format of the ranges. One of yyyy, yyyy-MM-dd or yyyy-MM-dd'T'HH-mm-ssZ.
  • ranges: object[] - The ranges to create buckets for.
    • ranges[].from: number|string - The lower limit of the bucket, inclusive. Optional for the first range if no lower limit is required.
    • ranges[].to: number|string -The upper limit of the bucket, exclusive. Optional for the last range if no upper limit is required.
  • options?: object - Config options.

returns:

  • DateRangeAggregation

Bucket results

Buckets are sorted by the order they were requested.

  {
  // name, fieldName, etc...
  buckets: [
    { key: '1940-1950', from: '1940', to: '1950', fromMillis: -946771200000, toMillis: -631152000000, docCount: 0 },
    { key: '1990-2000', from: '1990', to: '2000', fromMillis: 631152000000, toMillis: 946684800000, docCount: 2 },
  ],
  missingDocCount: 0
}

Aggregation Options

All aggregations can additionally be configured to have user-defined name and to include nested aggregations using the following options object structure.

  • name?: string - The name of the aggregation, e.g., to distinguish two aggregations on the same field. If undefined, the field name will be used.
  • aggregations?: Aggregation[] - Child aggregations to collect for each bucket of the aggregation.

Child aggregations can be requested as follows:

import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chcaa/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
  aggregations: [
    termAggregation('gender', {
      aggregations: [
        termAggregation('hobbies', 2) // Top 2 hobbies for each gender
      ]
    })
  ]
});

The result of the child aggregation will be attached to each parent bucket.

  {
  // name, fieldName, etc...
  buckets: [
    {
      key: 'female', docCount: 2,
      aggregations: [
        {
          // name, fieldName, etc...
          buckets: [
            { key: 'swimming', docCount: 1 },
            { key: 'cycling', docCount: 1 }
          ]
        }
      ]
    },
    {
      key: 'male', docCount: 1,
      aggregations: [
        {
          // name, fieldName, etc...
          buckets: [
            { key: 'swimming', docCount: 1 }
          ]
        }
      ]
    }
  ],
  missingDocCount: 0
}

SearchIndex Configuration

A SearchIndex can be further configured using the options argument where schema configuration can be customized as described in the Document Schema chapter as well as configuration of different default query and cache settings.

The options-object should be passed as the second argument to the SearchIndex constructor.

import { SearchIndex } from "@chcaa/text-search-lite";

let searchIndex = new SearchIndex([/* fields */], {
  schema: { /* settings*/ },
  query: { /* settings */ },
  filter: { /* settings*/ },
  aggregation: { /* settings*/ },
  sorting: { /* settings*/ }
});

Search index options

// TODO dokumenter source: { store, strategy }

  • schema?: object - General schema configuration options. See the Document Schema chapter.
  • source?: object - Document source object storage configuration.
    • store?: true - true if the document source object should be stored. This improves updates and deletes and makes highlighting possible without supplying an idToSourceResolver. Defaults to true.
    • strategy?: ("memory"|"db") - The storage strategy to use for storing the source object. In both cases the storage is temporary and only spans the current program execution. Defaults to "memory". To use the "db" strategy better-sqlite3 must be included as a dependency in the projects package.json.
  • query?: object - Default configuration of query options.
    • options?: object - Custom overrides of the default queryOptions used in search(). Each possible query option can be configured to have a default fallback if not provided in the runtime queryOptions passed to search(). The overrides will be merged with the system defined default queryOptions.
  • filter?: object - General filter configuration options.
    • cache?: object - Filter cache configuration.
      • maxSize?: number - The maximum size of the filter cache. Defaults to 100.
      • minDocCount?: number - The minimum number of document inputs to a filter before the filter is cached. Defaults to 100.
  • aggregation?: object - General aggregation configuration options.
    • cache?: object - Aggregation cache configuration.
      • maxSize?: number - The maximum size of the aggregation cache. Defaults to 100.
      • minDocCount?: number - The minimum number of document inputs to an aggregation before the aggregation is cached. Defaults to 100.
  • sorting?: object - General sorting configuration options.
    • cache?: object - Sorting cache configuration.
      • maxSize?: number - The maximum size of the sorting cache. Defaults to 100.
      • minDocCount?: number - The minimum number of document inputs to be sorted before the sorted result is cached. Defaults to 100.

Filters, aggregations, and sorting of search results each have their own cache which can be configured independently.

The cache works like a queue where the oldest elements are removed first when the limit of the cache is exceeded. The cached elements are stored using a key representing the content of the entry, ensuring that entries with the same content are only stored once.

To avoid unnecessary recalculations of "hot" cached entries and still only allow the same entry once in the cache, any existing entries are moved to the back of the queue each time they are requested.

Bm25f Scoring Algorithm

The scoring algorithm builds on the bm25f algorithm as described in foundations of bm25 review and Okapi bm25. The algorithm groups all the fields of a document (included in the search) with the same analyzer into a one virtual field before scoring the term against the virtual field.

This approach gives typically better results than scoring each field individually and then combing the result after scoring as the importance of a term is considered across all fields instead of each field in isolation.

The boost of a field is integrated into the algorithm by using the boost as multiplier for the term frequency in the given field and thereby making the term boost-factor more important in the field.

Formula:

  • streams/fields: s = 1, ..., S
  • stream length: sls
  • stream weight: vs
  • stream term frequency: tfs,i
  • avg. stream length across all docs: avsls
  • term: i
  • total docs with stream: n
  • docs with i in stream: dfn,i
  • stream length relevance: b
  • term frequency relevance: k1

bm25f

Tuning b and k1 parameters

b determines the impact of the field's length when calculating the score and is as default set to 0.75 and must be in the range [0–1]. Lower values mean smaller length impact and vice versa. b can be configured on a per-field basis and should for fields with only short text segments be considered to have a lower value so a change in length by only a few terms doesn't affect the score too much. E.g., could a title field have a b of 0.25.

For fields like person.name even a b value of 0.0 should be considered as a search for Andersen should probably yield the same score for both Gillian Andensen and Hans Chrisitan Andersen and not include the length of the name in the score at all. Either the person has the name searched for or not, the length of the total name is not relevant.

k1 determines the impact of the term frequency in matching fields and is in bm25f applied once pr. term to the score for all fields with the same analyzer (see formula above). k1 has a default of 1.2 but can be changed for the whole document index or for each analyzer individually.

It is also possible to change how the term frequency of a document affects the score by turning docStats.termFrequencies off for a field, which will result in the term count always being 1 if the term exists in the field, no matter the actual term frequency, and 0 if the term does not exist in the field.

docStats.termFrequencies is by default turned off for all other fields than text fields as other fields are not tokenized so counting term frequencies will in most cases not make any difference and just consume memory.

Method Summary SearchIndex

The SearchIndex exposes the following properties and methods:

  • docCount - The total number of documents in the index.
  • indexedFields - Name and type of all indexed fields.
  • sortingFields - Name and type of all fields that can be used for sorting.
  • filterFields - Name and type of all fields that can be used for filtering.
  • hasField(fieldName) - Tests if the field exists.
  • getFieldType(fieldName) - Returns the type of the field.
  • add(document) - Adds a document to the index.
  • addAll(documents) - Adds multiple documents to the index (optimized for performance).
  • update(document) - Updates a document in the index.
  • deleteById(id) - Removes a document from the index.
  • delete(document) - Removes a document from the index.
  • has(document) - Tests if the document is included in the index.
  • hasId(id) - Tests if the document id is included in the index.
  • getSource(id) - Returns the source document with the given id.
  • getSources(ids) - Returns the source documents with the given ids.
  • search(query, queryOptions) - Searches the index.
  • parseQueryStringToQueryObjects(queryString, [options]) - Parses the query-string to query objects.
  • validateAggregation(aggregation) - Validates the aggregation and throws an ValidationError if not valid.
  • validateFilter(filter) - Validates the filter and throws an ValidationError if not valid.
  • validateQueryString(queryString, [parseOptions]) - Validates the query string.
  • clearCache() - Clears any cached filters, aggregations, and sorted results. This is done automatically on any change to the index.
  • getAllFieldTerms(fieldName) - Returns all terms indexed for the field.
  • analyze(value, fieldName, [indexName], [includeOffsets]) - Analyzes the value with the Analyzer configured for the field and field index.
  • queryStringParseOptionsWith(options) <static> - Creates a query string parse options object where all the options not explicitly defined are set to false. index, so this method is typically not required to be called manually.

Package Sidebar

Install

npm i @chcaa/text-search-lite

Weekly Downloads

1

Version

0.15.0

License

ISC

Unpacked Size

557 kB

Total Files

100

Last publish

Collaborators

  • donbjarkone
  • jedglow
  • pbvahlst