Learn about our RFC process, Open RFC meetings & more.Join in the discussion! »

handlens

1.0.0-b8 • Public • Published

🔎 handlens

Search like you expect

handlens is a document full-text search engine with zero dependencies.

Table of Contents

Installation

npm install handlens

Setup

import handlens from "handlens";

Pass a function to the handlens() function.

The function is called with the new index as the context (the value of this), and as the first parameter.

Note that if you use an arrow function expression, you must use the first parameter since the arrow function will not rebind its this value

Using the index provided to your function, set fields to index, documents to search, and an optional document reference (the property to use to uniquely identify the document).

var mySearchableIndex = handlens( ( index ) => {
    index.fields = [
        "body",
        "title"
    ];
    index.documents = [
        {
            "bookId": 1,
            "title": "A Tale of Two Cities",
            "source": "https://en.wikiquote.org/wiki/A_Tale_of_Two_Cities",
            "body": "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."
        },
        {
            "bookId": 2,
            "title": "Of Mice And Men",
            "source": "https://2paragraphs.com/2012/08/of-mice-and-men/",
            "body": "A few miles south of Soledad, the Salinas River drops in close to the hillside bank and runs deep and green."
        }
    ];
} );

All of these values are also available later, but you will need to tell the index to rebuild.

var mySearchableIndex = handlens();

mySearchableIndex.fields = [ "body", "title" ];
mySearchableIndex.documents = [
    {
        "bookId": 1,
        "title": "A Tale of Two Cities",
        "source": "https://en.wikiquote.org/wiki/A_Tale_of_Two_Cities",
        "body": "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."
    },
    {
        "bookId": 2,
        "title": "Of Mice And Men",
        "source": "https://2paragraphs.com/2012/08/of-mice-and-men/",
        "body": "A few miles south of Soledad, the Salinas River drops in close to the hillside bank and runs deep and green."
    }
];

mySearchableIndex.rebuild();

Searching

Once you have created an index, it can be searched at any time.

var mySearchableIndex = handlens( ( index ) => {
    index.fields = [
        "body",
        "title"
    ];
    index.documents = [
        {
            "bookId": 1,
            "title": "A Tale of Two Cities",
            "source": "https://en.wikiquote.org/wiki/A_Tale_of_Two_Cities",
            "body": "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only."
        },
        {
            "bookId": 2,
            "title": "Of Mice And Men",
            "source": "https://2paragraphs.com/2012/08/of-mice-and-men/",
            "body": "A few miles south of Soledad, the Salinas River drops in close to the hillside bank and runs deep and green."
        }
    ];
} );

mySearchableIndex.search( "mice cities" );
// returns: [ { "ref": "1" }, { "ref": "2" } ]

You can search only specific fields.

mySearchableIndex.search( "body:mice title:cities" );
// returns: [ { "ref": "1" } ]

You can search with boolean AND.

// implicit Boolean OR
mySearchableIndex.search( "body:mice title:cities" );
// returns: [ { "ref": "1" } ]

// explicit Boolean OR
mySearchableIndex.search( "body:mice OR title:cities" );
// returns: [ { "ref": "1" } ]

// Boolean AND
mySearchableIndex.search( "title:mice AND title:cities" );
// returns: [] <-- No documents contain "mice" AND "cities" in the title

.search is great for allowing user input, but it requires a lot of inefficient string parsing.
If you are searching programmatically, you should use .query instead. .search runs the string input through a querybuilder and then immediately calls .query with the resulting queries.

A query is an array of objects in the format:

{
    "bool": boolean,
    "fields": {
        "*": tokens,
        fieldName: tokens
    }
}

boolean is one of [ "OR", "AND" ]. If "OR", all tokens will be compared individually. A document that matches one token but does not match another given token will still be considered a match. If "AND", every given token must be matched in a single document for it to be considered matching.

tokens must be an array of token strings. Note that by default tokens are processed by splitting on whitespace, so if you provide tokens in another format, you will not get matches.

fieldName must be any registered field. That is, if you created the index with index.fields = [ "alpha" ]; the only allowable value for fieldName is "alpha".

The "any field" ("*") entry is required, but the value can be an empty array.

Settings

When creating an index (or once one is created), various settings can be altered that could significantly alter the way handlens works.

var myIndex = handlens( ( index ) => {
    index.settings.documents.retainAfterIndex = false;
} );

myIndex.settings.documents.retainAfterIndex = true;

All Settings:

Setting Default What It Does
settings.documents.retainAfterIndex true If this value is false, rebuilding the index will delete all of the source documents. Adding a document with this set to false will not store the new document. This is nice if you have an enormous amount of data and you can reference it elsewhere so that the index itself doesn't store a copy of everything.
settings.tokenize.separator /\s+/ Determines how the tokenizer splits strings. By default it grabs as much contiguous whitespace as possible and splits the tokens on that. This value is passed directly to String.prototype.split, so it can either be a RegExp or a String.
settings.tokenize.lowercase true If false, the tokenizer will not lowercase every token it finds. Convenient if you want case-sensitive searching, but keep in mind that hello and Hello are not the same token if lowercasing is turned off.
settings.stopwords.lang "en" The language to use when stripping stopwords from tokens.
settings.stopwords.list { "en": [...], ... } Very long lists of stopwords for a bunch of languages. Arrays of strings keyed by ISO 639-1 two-letter language code.

Advanced

Rather than constantly rebuilding the index with a modified set of documents - even the set is only different by one or two - you can use .addDocument.

var idx = handlens();

idx.documents = [ ..., ... ];
idx.rebuild();

idx.addDocument( { ... } );

This will index just that document without re-indexing every other document.

The same format is available for fields.

var idx = handlens();

idx.fields = [ "alpha" ];
idx.rebuild();

idx.addField( "beta" );

Note, however, that adding a field changes the entire root structure of the index, so a .rebuild is issued after adding a field.

It would be prudent to determine the list of fields before initializing the index. Likewise, if you need to add a number of fields, it would be best to simply push them onto the list and then issue a single .rebuild at the end. This method is provided for convenience only and is not the most efficient way to modify the list of fields.

Planned Features

  • Parenthetical groups, distribution, and expansion
    • title:(mice cities) AND body:(winter river)
      • This query should search for:
        • mice in the title AND winter in the body
        • OR mice in the title AND river in the body
        • OR cities in the title AND winter in the body
        • OR cities in the title AND river in the body
    • This behavior is currently achievable by being extremely verbose
      • title:mice AND body:winter title:mice AND body:river title:cities AND body:winter title:cities AND body:river
  • Affinities
    • Matches should be ranked by how high the affinity is between the document and the query.
      • A document that is nothing but the word "cats" repeated hundreds of times should have a much higher affinity for a search like cats than an article that is regular English, even if it is about the topic of cats (and may therefore contain the word cats a few times).
  • Field Boosting
    • It should be possible to boost a field at query time or at index time to increase the affinity of matches found in that field
  • Token Boosting
    • It should be possible to boost a single token at query time so that matches for that token have an increased affinity
  • Fuzzy matching
    • hallo should match hello with a small affinity hit
    • More info
  • Stemming
    • automobile and automotive should be stemmed to automo and searches for automotive should also match automobile with a small affinity hit (and vice versa).
    • More info

Install

npm i handlens

DownloadsWeekly Downloads

0

Version

1.0.0-b8

License

MIT

Last publish

Collaborators

  • avatar