Share your code. npm Orgs help your team discover, share, and reuse code. Create a free org »

    algolia-webcrawlerpublic

    Algolia Webcrawler David DM npm version Build Status Greenkeeper badge

    Simple node worker that crawls sitemaps in order to keep an Algolia index up-to-date.

    It uses simple CSS selectors in order to find the actual text content to index.

    This app uses Algolia's library.

    TL;DR

    1. Usage
    2. Pre-requesites
    3. Installation
    4. Running
    5. Configuration file
    6. Configuration options
    7. Stored Object
    8. Indexing
    9. License

    Usage

    This script should be run via crontab in order to crawl the entire website at regular interval.

    Pre-requesites

    1. Having at least one valid sitemap.xml url that contains all the url you want to be indexed.
    2. The sitemap(s) must contain at least the <loc> node, i.e. urlset/url/loc.
    3. An empty Algolia index.
    4. An Algolia Credential that can create objects and set settings on the index, i.e. search, addObject, settings, browse, deleteObject, editSettings, deleteIndex

    Installation

    1. Get the latest version
      • npm npm i algolia-webcrawler -g
      • git
        • ssh+git: git clone git@github.com:DeuxHuitHuit/algolia-webcrawler.git
        • https: git clone https://github.com/DeuxHuitHuit/algolia-webcrawler.git
      • https download the latest tarball
    2. create a config.json file

    Running

    npm

    algolia-webcrawler --config config.json
    

    other

    cd to the root of the project and run node app.

    Configuration file

    Configuration is done via the config.json file.

    You can choose a config.json file stored elsewhere usign the --config flag.

    node app --config my-config.json

    Configuration options

    At the bare minimum, you can edit config.json to set a values to the following options: 'app', 'cred', 'indexname' and at least one 'sitemap' object. If you have multiple sitemaps, please list them all: sub-sitemaps will not be crawled.

    All options are required. No defaults are provided.

    app: String

    The name of your app.

    cred: Object

    Algolia crendentials object. See 'cred.appid' and 'cred.apikey'.

    cred.appid: String

    Your Algolia App ID.

    cred.apikey: String

    Your generated Algolia API key.

    delayBetweenRequests: Integer

    Simple delay between each requests made to the website in milliseconds.

    oldentries: Integer

    The maximum number of seconds an entry can live without being updated. After each run, the app will search for old entries and delete them. If you do not wish to get rid of old entries, set this value to 0.

    maxRecordSize: Integer

    The maximum size in bytes of a record to be sent to Algolia. The default is 10,000 but could vary based on different plans.

    index: Object

    An object containing various values related to your index.

    index.name: String

    Your index name.

    index.settings: Object

    An object that will act as argument to Algolia's Index#setSetting method.

    Please read Algolia's documentation on that subject. Any valid attribute documented for this method can be used.

    index.settings.attributesToIndex: Array

    An array of string that defines which attributes are indexable, which means that full text search will be performed against them. For a complete list of possible attributes see the Stored Object section.

    index.settings.attributesForFaceting: Array

    An array of string that defines which attributes are filterable, which means that you can use them to exclude some records from being returned. For a complete list of possible attributes see the Stored Object section.

    sitemaps: Array

    This array should contain a list of sitemap objects.

    A sitemap is a really simple object with two String properties: url and lang. The 'url' property is the exact url for this sitemap. The 'lang' property should explicit the main language used by url found in the sitemap.

    http: Object

    An object containing different http options.

    http.auth: String

    The auth string, in node's username:password form. If you do not need auth, you still need to specify an empty String.

    selectors: Object

    An object containing CSS selectors in order to find the content in the pages html.

    selectors.title: String

    CSS selector for the title of the page.

    selectors.description: String

    CSS selector for the description of the page.

    selectors.image: String

    CSS selector for the image of the page.

    selectors.text: String

    CSS selector for the title of the page.

    selectors[key]: String

    CSS selector for the "key" property. You can add custom keys as you wish.

    exclusions: Object

    An object containing CSS selectors to find elements that must not be indexed. Those CSS selectors are matched for each node and are check against all their parents to make sure non of its parent are excluded.

    exclusions.text: String

    CSS selector of excluded elements for the text of the page.

    exclusions[key]: String

    CSS selector of excluded elements for "key" property. The key must match the one used in selectors[key].

    formatters: Object

    An object containing formatter string. Their values are removed from the original result obtained with the associated CSS selector.

    formatters.title: String,Array

    The string to remove from the title of the page. Can also be an array of strings.

    formatters[key]: String,Array

    The string to remove from the specified key. Can also be an array of strings.

    types[key]: String

    The parse function used to format the value. Supported types are "integer", "float" and "json".

    defaults[key]: String

    The default value inserted for the specified key. Will be set if the value is falsy.

    plugins: Array

    A list of javascript files to load custom code before saving the record. The only requirement is to implement the following interface, where record is the object to be saved and data is the html.

    module.exports = (record, data) => {
        record.value_from_plugin = 'Yay!';
    };

    blacklist: Array

    All url are checked against all items in the blacklist. If the complete url or its path component is in the blacklist, it won't get indexed.

    Stored Object

    The stored object on Algolia's server is as follows

    {
        date: new Date(),
        url: 'http://...',
        objectID: shasum.digest('base64'),
        lang: sitemap.lang,
        title: '',
        description: '',
        image: '',
        text: ['...']
    }

    One thing to notice is that text is an array, since we tried to preserve the original text node -> actual value relationship. Algolia handle this just fine.

    Indexing

    Indexing is done automatically, at each run. To tweak how indexing works, please see the index.settings configuration option.

    LICENSE

    MIT
    Made with love in Montréal by Deux Huit Huit
    Copyrights (c) 2014-2017

    install

    npm i algolia-webcrawler

    Downloadslast 7 days

    21

    version

    2.2.1

    license

    MIT

    repository

    github.com

    last publish

    collaborators

    • avatar
    • avatar