dga-sync README

Sync datasets easily

The Australian government's website references a growing abundance of public and open data government data resources - more than 3700 datasets at the time of writing. While in some cases, provides an API to access a dataset, it doesn't always. For this reason and others, there are often advantages in downloading the data for local use or to re-package it. The dga-sync utility eases the task of synchronising that data to a local file system.

dga-sync uses the JSON metadata stored on for each dataset to ensure that data files are only downloaded if they are newer than what has previously been downloaded. A local copy of the metadata is also stored.

Getting started

npm install dga-sync

Simple usage

For each dataset, there is a JSON metadata file (accessed from the JSON button on the web page) that leads to a URL of the following form:

Use the id parameter to identify the package to sync:

var sync = require('dga-sync');

This is what the console output looks like (actual output is colourised where supported):

fetching metadata for package ID: 23218e8f-babe-4e37-81d1-5424a4d1c568
found: "Public Barbeques"
reply lists 5 resources:
   barbeque.kmz "2014 Public Barbeques" @ 2014-09-16T02:05:54.523Z
   wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv "Public Barbeques CSV" @ 2014-09-16T02:05:54.523Z
   wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json "Public Barbeques GeoJSON" @ 2014-09-16T02:05:54.523Z
   wms?request=GetCapabilities "Public Barbeques - Preview this Dataset (WMS)" @ 2014-09-16T02:05:54.523Z
   wfs?request=GetCapabilities "Public Barbeques Web Feature Service API Link" @ 2014-09-16T02:05:54.523Z
preparing to download barbeque.kmz
preparing to download wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv
preparing to download wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json
preparing to download wms?request=GetCapabilities
preparing to download wfs?request=GetCapabilities
downloading completed
  .. moving data/._DGA_DOWNLOAD_barbeque.kmz to data/barbeque.kmz
  .. moving data/._DGA_DOWNLOAD_wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv to data/wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv
  .. moving data/._DGA_DOWNLOAD_wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json to data/wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json
  .. moving data/._DGA_DOWNLOAD_wms?request=GetCapabilities to data/wms?request=GetCapabilities
  .. moving data/._DGA_DOWNLOAD_wfs?request=GetCapabilities to data/wfs?request=GetCapabilities
writing download metadata to: data/._METADATA_.json

At this point, a directory called data under the current working directory will have been created and will contain the downloaded resources plus a metadata file created by dga-sync:

$ ls -lhA data
total 744K
-rw-r--r-- 1 sam sam  44K Sep 24 11:14 barbeque.kmz
-rw-r--r-- 1 sam sam 6.0K Sep 24 11:15 ._METADATA_.json
-rw-r--r-- 1 sam sam  72K Sep 24 11:15 wfs?request=GetCapabilities
-rw-r--r-- 1 sam sam  95K Sep 24 11:14 wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv
-rw-r--r-- 1 sam sam 384K Sep 24 11:14 wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json
-rw-r--r-- 1 sam sam 139K Sep 24 11:14 wms?request=GetCapabilities

The metadata file will ensure that next time we check, only newer resources than we already have will be downloaded, saving on bandwidth.

Limiting what gets downloaded

As you can see from above, all resources are downloaded by default. This can be changed by adding an idFilter regex option. So if we only want the KMZ files in our example:

    idFilter: /.*\.kmz$/,
    deleteUnlisted: true

The use of deleteUnlisted is optional - it tells dga-sync to delete previously downloaded files now excluded by the filter. The contents of data is now:

$ ls -lhA data
total 48K
-rw-r--r-- 1 sam sam  44K Sep 24 11:14 barbeque.kmz
-rw-r--r-- 1 sam sam 1.4K Sep 24 11:26 ._METADATA_.json


There is currently only one method:

syncByPackageId(packageId, options, andThen)

packageId - the ID of the package/dataset

options - an object with the following options:

  • idFieldName - specifies the field in a resource to use as the resource ID [default: 'url']

  • idCanonicaliser - a function that takes the resource ID (according to the idFieldName option) and creates a canonical ID for future comparison in later sync operations [default: split the ID at '/'s and use use the last part: this assumes that idFieldName is the default value of 'url']

  • idFilter - applied to the (canonicalised) resource ID to choose which resources will be synced [default: undefined - that is, accept all IDs]

  • dataDestination - the directory to store the downloaded resources in

  • deleteUnlisted - boolean: true means delete extraneous files in the destination directory that don't correspond to a resource IDs in the filtered list [default: false]

andThen(err) - optional callback, where err is any error encountered that prevented successful completion


