dga-sync README
Sync data.gov.au datasets easily
The Australian government's data.gov.au website references a growing abundance of public and open data government data resources - more than 3700 datasets at the time of writing. While in some cases, data.gov.au provides an API to access a dataset, it doesn't always. For this reason and others, there are often advantages in downloading the data for local use or to re-package it. The dga-sync utility eases the task of synchronising that data to a local file system.
dga-sync uses the JSON metadata stored on data.gov.au for each dataset to ensure that data files are only downloaded if they are newer than what has previously been downloaded. A local copy of the metadata is also stored.
Getting started
npm install dga-sync
Simple usage
For each data.gov.au dataset, there is a JSON metadata file (accessed from the JSON button on the web page) that leads to a URL of the following form:
http://data.gov.au/api/3/action/package_show?id=23218e8f-babe-4e37-81d1-5424a4d1c568
Use the id
parameter to identify the package to sync:
var sync = require('dga-sync');
sync.syncByPackageId('23218e8f-babe-4e37-81d1-5424a4d1c568');
This is what the console output looks like (actual output is colourised where supported):
fetching metadata for package ID: 23218e8f-babe-4e37-81d1-5424a4d1c568
found: "Public Barbeques"
reply lists 5 resources:
barbeque.kmz "2014 Public Barbeques" @ 2014-09-16T02:05:54.523Z
wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv "Public Barbeques CSV" @ 2014-09-16T02:05:54.523Z
wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json "Public Barbeques GeoJSON" @ 2014-09-16T02:05:54.523Z
wms?request=GetCapabilities "Public Barbeques - Preview this Dataset (WMS)" @ 2014-09-16T02:05:54.523Z
wfs?request=GetCapabilities "Public Barbeques Web Feature Service API Link" @ 2014-09-16T02:05:54.523Z
preparing to download barbeque.kmz
preparing to download wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv
preparing to download wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json
preparing to download wms?request=GetCapabilities
preparing to download wfs?request=GetCapabilities
downloading completed
.. moving data/._DGA_DOWNLOAD_barbeque.kmz to data/barbeque.kmz
.. moving data/._DGA_DOWNLOAD_wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv to data/wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv
.. moving data/._DGA_DOWNLOAD_wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json to data/wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json
.. moving data/._DGA_DOWNLOAD_wms?request=GetCapabilities to data/wms?request=GetCapabilities
.. moving data/._DGA_DOWNLOAD_wfs?request=GetCapabilities to data/wfs?request=GetCapabilities
writing download metadata to: data/._METADATA_.json
At this point, a directory called data
under the current working directory
will have been created and will contain the downloaded resources plus a metadata
file created by dga-sync:
$ ls -lhA data
total 744K
-rw-r--r-- 1 sam sam 44K Sep 24 11:14 barbeque.kmz
-rw-r--r-- 1 sam sam 6.0K Sep 24 11:15 ._METADATA_.json
-rw-r--r-- 1 sam sam 72K Sep 24 11:15 wfs?request=GetCapabilities
-rw-r--r-- 1 sam sam 95K Sep 24 11:14 wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=csv
-rw-r--r-- 1 sam sam 384K Sep 24 11:14 wfs?request=GetFeature&typeName=23218e8f_babe_4e37_81d1_5424a4d1c568&outputFormat=json
-rw-r--r-- 1 sam sam 139K Sep 24 11:14 wms?request=GetCapabilities
The metadata file will ensure that next time we check, only newer resources than we already have will be downloaded, saving on bandwidth.
Limiting what gets downloaded
As you can see from above, all resources are downloaded by default. This can
be changed by adding an idFilter
regex option. So if we only want the KMZ
files in our example:
sync.syncByPackageId(
'23218e8f-babe-4e37-81d1-5424a4d1c568',
{
idFilter: /.*\.kmz$/,
deleteUnlisted: true
}
);
The use of deleteUnlisted
is optional - it tells dga-sync to delete
previously downloaded files now excluded by the filter. The contents of
data
is now:
$ ls -lhA data
total 48K
-rw-r--r-- 1 sam sam 44K Sep 24 11:14 barbeque.kmz
-rw-r--r-- 1 sam sam 1.4K Sep 24 11:26 ._METADATA_.json
API
There is currently only one method:
syncByPackageId(packageId, options, andThen)
packageId
- the ID of the package/dataset
options
- an object with the following options:
-
idFieldName
- specifies the field in a resource to use as the resource ID [default:'url'
] -
idCanonicaliser
- a function that takes the resource ID (according to theidFieldName
option) and creates a canonical ID for future comparison in later sync operations [default: split the ID at '/'s and use use the last part: this assumes thatidFieldName
is the default value of'url'
] -
idFilter
- applied to the (canonicalised) resource ID to choose which resources will be synced [default:undefined
- that is, accept all IDs] -
dataDestination
- the directory to store the downloaded resources in -
deleteUnlisted
- boolean:true
means delete extraneous files in the destination directory that don't correspond to a resource IDs in the filtered list [default:false
]
andThen(err)
- optional callback, where err
is any error encountered that
prevented successful completion