Drupal JSON:API Extractor

node

This package is a Drupal json:api client library with one primary responsibility - to crawl through a Drupal produced json:api and save the resulting data to static json files in directory structures to allow easy access to the files.

Why all the trouble? For Drupal sites with only hundreds or low thousands of pages (the majority) enabling the (now core) json:api module in conjunction with this library allows for fully static front ends. Having a way to export all of a site's data to static json files allows those files to be deployed, statically, along with a site's decoupled front end.

It also presents an opportunity to transform the standard json:api output to something a little more friendly for developers to work with. Ideally this library is used during the static generation process.

Getting started

Crawling all drupal nodes of a given content type with each node's associated relationships (including paragraphs) is pretty easy.

const { Spider } = require('drupal-jsonapi-extractor')
 
const baseURL = 'https://example.org/jsonapi/'
const spider = new Spider({ baseURL })
 
spider.crawl('/node/blog')
// or to crawl every published node
spider.crawlNodes()

While the above Spider does crawl through an entire set of content types it does not actually do anything with the results. This is where we introduce the Extractor object.

const { Spider, Extractor } = require('drupal-jsonapi-extractor')
 
const baseURL = 'https://example.org/jsonapi/'
const spider = new Spider({ baseURL })
const extractor = new Extractor(spider, { location: './downloads' })
 
extractor.wipe().then(() => spider.crawl('/node/content-type'))

Note: The extractor has a helpful utility function wipe() which will returns a Promise and ensures the target directory is completely empty before resolving.

The above code will output a new downloads directory with the structure:

downloads/
  _resources/
    node/
      blog/
        0ef56bbd-b2d6-475e-8b83-e1fa9bc1e7fb.json
    paragraph/
      hero/
        425a6dc1-5158-4f12-8d54-eb8a7af369f0.json
    taxonomy_term/
      tags/
        2d850e4b-9d2f-4b8f-b1e7-ad959de8b393.json
  _slugs/
    node/
      1.json
    blogs/
      my-first-blog-post.json

This structure is intended to serve static sites well by allowing lookup by the unique json:api global unique id, as well as the more traditional drupal path (node/1) and a node's alias "slug" (/blogs/my-first-blog-post).

The extractor by default saves the exact output of the json:api. However, when developing your decoupled front end you may prefer a slightly less verbose json schema. This package includes a transformer that allows easily "cleaning" of the output:

const extractor = new Extractor(spider, {
  location: './downloads'
  clean: true
})

Sometimes it is nice to see the progress of the download process. This package includes a console logger as well.

const { Spider, Extractor, Logger } = require('drupal-jsonapi-extractor')
 
const baseURL = 'https://example.org/jsonapi/'
const spider = new Spider({ baseURL })
const extractor = new Extractor(spider, { location: './downloads' })
const logger = new Logger([spider, extractor])
 
spider.crawl('/node/content-type')

The logger in our example would print to the command line:

✔️  node: 1
✔️  taxonomy_term: 1
✔️  paragraph: 1
----------------------------
🎉   Crawl complete!
Errors.................0
node...................1
paragraph..............1
taxonomy_term..........1

Configuration options

Each of the provided classes have a number of configuration options.

Spider

You pass options as the first argument when instantiating a new Spider.

new Spider(options)

{
  // (required) Should include the /jsonapi/ segment
  baseURL: 'https://example.org/jsonapi/'
 
  // (optional) Instance of axios with baseURL already applied
  api: axios,
 
  // Quite the program on a crawl error
  terminateOnError: false,
 
  // What is the maximum number of concurrent api requests the spider can open.
  // you get timeout errors from the api, reduce this number.
  maxConcurrent: 5,
 
  // (optional) Resource class configuration options
  resourceConfig: {
    // (optional) Array of regex that is used to determine which relationships should be crawled
    relationships: [
      // By default, only relationships that start with field_ are crawled
      new RegExp(/^field_/)
    ]
  }
}

Extractor

You pass options as the second argument when instantiating a new Extractor.

const extractor = new Extractor(spider, options)
extractor.wipe().then(() => spider.crawlNodes())
// To limit the depth of a crawl, pass a max depth (rarely needed since the
// package handles recursive references)
extractor.wipe().then(() => spider.crawlNodes(5))

Note: above we use a helpful utility method wipe() which will returns a Promise and ensures the target directory is completely empty before resolving.

{
  // The location to save files (will create directories automatically)
  location: './',
 
  // Should the data be transformed or "cleaned" before being saved to disk?
  clean: false,
 
  // Sometimes it's helpful to see pretty-printed json, just flip this to true.
  pretty: false,
 
  // The function to pass each Resource through before saving it if clean is true
  // By default we use our own transform function, this function takes a number of
  // options itself, or you can choose to use your own callback altogether.
  transformer: transformer({
 
    // Array of regular expressions to keep or reject each key in the "attributes"
    // section of a json:api response. Matches are kept. These are the defaults.
    attributeFilters: [
      /^field_/, // Common field prefix
      /^(title|created|changed|langcode|body)$/, // Common for node entities
      /^(name|weight|description)$/, // Common for taxonomies
      /^(parent_type|parent_id)$/ // Common for paragraphs,
    ],
    
    // Same functionality as the attributeFilters but applied to the
    // "relationships" section of the json:api response.
    relationshipFilters: [
      /^field_/ // Common field prefix
    ],
 
    // Within each "field" we can remove certain fields we no longer want, in
    // this case properties of field that contains an object. Matches are removed.
    fieldPropertyFilters: [
      /^links$/
    ],
 
    // A callback that is passed a Resource object and expected to return a
    // cleaned up "fields" object. The default applies the above filters, but
    // a custom callback could be used here.
    cleanFields: callback
  })
}

Internally this library represents every crawled response with a Resource object. If you choose to override the transformer callback it will be given a Resource as an argument. You can read the source code for details on it's functionality. If you want change the configuration options of our transformer, you can customize it:

const { Spider, Extractor, transformer } = require('drupal-jsonapi-extractor')
 
const baseURL = 'https://example.org/jsonapi/'
const spider = new Spider({ baseURL })
const extractor = new Extractor(spider, {
  location: './downloads'
  clean: true,
  transformer: transformer({
    attributeFilters: [
      /^custom_attribute_to_keep$/
    ]
  })
})
 
spider.crawl('/node/content-type')

Logger

The logger, at the moment, is pretty simple with just one configuration option:

new Logger([...emitters], {
  // Set the verbosity of the logger:
  // 0 - Log nothing
  // 1 - (default) Show a simple tally of number of downloads and number of errors
  // 2 - Log each entity and error as its downloading
  // 3 - Log every event being listened to by the logger
  verbosity: 1
})

To do

Currently there is effectively no test coverage, although test files for the classes have been written with an instantiation check in each.