site-mapper

sitemap.xml generation in node.js

site-mapper

Site Map Generation in node.js

This module is intended to be used as a dependency in a website specific site map building project. Add the module to the "dependencies" section of a package.json file:

{
  "dependencies": {
    "site-mapper": ">= 1.1.0"
  }
}

There is nothing stopping you from installing it locally and editing the embedded configuration files, but that is not the intent.

npm install site-mapper

Create a directory to hold your site map generation configuration. This directory will hold all the files needed to tell site-mapper what to create.

Create a package.json file similar to the following:

{
  "author": {
    "name": "YOUR NAME HERE",
    "email": "YOUR EMAIL HERE"
  },
  "name": "my-website-site-maps",
  "description": "sitemap generation for mysite.com",
  "version": "0.0.1",
  "homepage": "",
  "keywords": [
    "sitemap"
  ],
  "dependencies": {
    "site-mapper": ">= 1.0.1"
  },
  "engines": {
    "node": "*"
  }
}

Create a directory called ./config. For each environment you will generate sitemaps for (possibly you only have one, production), create a javascript or coffeescript file in the config directory named for the environment:

./config/production.coffee

or

./config/production.js

If you want to use coffeescript for configuration, you will have to add the coffee script module as a dependency:

{
  "dependencies": {
    "site-mapper": ">= 1.0.1"
    ,"coffee-script-redux": ">0"
  }
}

The configuration file can contain any of the following keys. The values below are defaults that will be used unless overridden in your configuration file.

config = {}
config.sources = {}
config.sitemaps = {}
config.defaultSitemapConfig = {
  targetDirectory: "#{process.cwd()}/tmp/sitemaps/#{config.env}"
  sitemapIndex: "sitemap.xml"
  sitemapRootUrl: "http://www.mysite.com"
  sitemapFileDirectory: "/sitemaps"
  maxUrlsPerFile: 50000
  urlBase: "http://www.mysite.com"
}
config.sitemapIndexHeader = '<?xml version="1.0" encoding="UTF-8"?><sitemapindex xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/siteindex.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'
config.sitemapHeader = '<?xml version="1.0" encoding="UTF-8"?><urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1" xmlns:geo="http://www.google.com/geo/schemas/sitemap/1.0" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9/">'
config.defaultUrlFormatter = (options) ->
  (href) ->
    if '/' == href
      options.urlBase
    else if href && href.length && href[0== '/'
      "#{options.urlBase}#{href}"
    else if href && href.length && href.match(/^https?:\/\//)
      href
    else
      if href.length
        "#{options.urlBase}/#{href}"
      else
        options.urlBase

The sitemaps object contains named keys pointing at objects that define a particular sitemap. The sitemap definition can contain (and override) any of the keys in the config.defaultSitemapConfig object. The produced sitemap consists of a sitemap index xml file referencing one or more gzipped sitemap xml files, created from urls produced by the config.sources objects. The configuration allows for defining 1 or more sitemaps to create, for example, you might configure one sitemap for the http version of a site and another sitemap for the https version of the site. Or you might define one sitemap for the www subdomain and another for the foobar subdomain. By default, all sources defined in config.sources are used to generate urls for all sitemaps. To use different sources for different sitemaps, provide in each sitemap configuration object a sources key like one of the following:

# Specify which sources to include. All others are ignored 
sources:
  includes: ['source1''source2'...]

or

# Specify which sources to exclude. All others are included 
sources:
  excludes: ['source1''source2'...]

The sources object contains arbitrarily named keys pointing at functions that take a single sitemapConfig object and return an object with the following keys: type, options. The input parameter, sitemapConfig, is an object formed by merging the config.defaultSitemapConfig object with a specific sitemap configuration (more about this later).

In the returned object, the type is either one of the built in Source types (see below) or a site specific class derived from the Source base class. See the test suite for examples of creating Source subclasses.

A minimal config might be:

{StaticSetSourceHttpSource} = require 'site-mapper'
appConfig =
  sitemaps:
    main:
      sitemapRootUrl: "http://staging.mysite.com"
      urlBase: "http://staging.mysite.com"
      sitemapIndex: "sitemap_index.xml"
      targetDirectory: "#{process.cwd()}/tmp/sitemaps/#{config.env}/http"
  sources:
    staticUrls:
      type: StaticSetSource
      options:
        channel: 'static'
        changefreq: 'weekly'
        priority: 1
        urls: [
          '/',
          '/about',
          '/faq',
          '/jobs'
        ]
    serviceUrls:
      type: HttpSource
      options:
        changefreq: 'weekly'
        priority: 0.8
        serviceUrl: "http://api.mysite.com/widgets"
        channelForUrl: (url) ->
          url.category
        bodyProcessor: (body) ->
          urls = JSON.parse(body)
          urls.map (url) ->
            {permalink: url.permalinkupdatedAt: url.updated_atcategory: url.category}
        urlFormatter: (url) ->
          "http://www.mysite.com/widgets/#{url.category}/#{url.permalink}"
 
module.exports = appConfig

Finally, putting it all together, you can generate the sitemaps as follows:

  1. Install all the dependencies: rm -rf node_modules; npm install
  2. Run the generator: NODE_ENV=staging ./node_modules/.bin/site-mapper

Below is a make file that encapsulates the above recipe. It can be run by running:

make setup generate
usage :
    @echo ''
    @echo 'Core tasks                       : Description'
    @echo '--------------------             : -----------'
    @echo 'make setup                       : Install dependencies'
    @echo 'make generate                    : Generate the sitemaps'
    @echo ''
 
COFFEE=./node_modules/.bin/coffee
SITEMAPPER=./node_modules/.bin/site-mapper
NPM_ARGS=
NODE_ENV=staging
 
setup :
    @rm -rf node_modules
    @echo npm $(NPM_ARGS) install
    @npm $(NPM_ARGS) install
 
generate :
    @rm -rf tmp
    @NODE_ENV=$(NODE_ENV) $(SITEMAPPER)

The site-mapper module views the sitemap generation process as follows:

+--------+          +------------+        +--------------+       +---------+
| Source |          | SiteMapper |        | SitemapGroup |       | Sitemap |
|--------|produces  |------------|  adds  |--------------| adds  |---------|
|        +--------->|            |------->|              |------>|         |
|        |  urls    |            |  urls  |              | urls  |         |
+--------+          +------------+        +--------------+       +---------+

There are three included Source implementations:

  1. StaticSetSource - This source is configured with a static list of urls in a configuration file.
  2. CsvFileSource - Reads a list of urls from a CSV file.
  3. HttpSource - Reads urls from a HTTP service

The urls produced by Source objects can be simple strings or objects, if finer grained control of sitemap.xml tag values is required.

Each url is associated with a channel. The channel can be set per source, derived from the url path or set explicitly if the url is an object. The channel is used to name individual sitemap files. The default behavior is to take the first path segment as the channel. Thus, the url

/stuff/path/to/stuff.html

would be assigned to the 'stuff' channel and would be put in sitemaps named stuff0.xml.gz, stuff1.xml.gz, etc.

A sitemap group is created for each URL channel. As urls are added to their corresponding SitemapGroup, the group will create sequentially numbered Sitemap files, each containing a configurable number of urls, 50,000 by default. The name of the sitemap files is of the form

${CHANNEL}${SEQUENCE}.xml.gz

site-mapper will create a sitemap index and as many sitemap files as are required. The sitemap files are gzipped. There is no way to turn gzipping off.

It is up to you how you expose the site maps generated by site mapper to Google and other search engines. Each company does this differently, so there are no default publishing mechanisms in site-mapper.

make test

This should install all dependencies and run the test suite using coffee