dpm2

Like npm but for data packages!

dpm2

Like npm but for data packages!

Usage:

##CLI

$ dpm --help
Usage: dpm <command> [options] where command is:
  - cat       <datapackage name>[@<version>]
  - get       <datapackage name>[@<version>] [-f, --force] [-c, --cache]
  - clone     <datapackage name>[@<version>] [-f, --force]
  - install   <datapackage name 1>[@<version>] <datapackage name 2>[@<version>] ... [-c, --cache] [-s, --save] [-f, --force]
  - publish
  - unpublish <datapackage name>[@<version>]
  - adduser
  - owner <subcommand> where subcommand is:
    - ls  <datapackage name>
    - add <user> <datapackage name>
    - rm  <user> <datapackage name>[@<version>]
  - search [search terms]

Given a data package:

$ cat package.json

{
  "name": "mydpkg",
  "description": "my datapackage",
  "version": "0.0.0",
  "keywords": ["test", "datapackage"],

  "resources": [
    {
      "name": "inline",
      "schema": { "fields": [ {"name": "a", "type": "string"}, {"name": "b", "type": "integer"}, {"name": "c", "type": "number"} ] },
      "data": [ {"a": "a", "b": 1, "c": 1.2}, {"a": "x", "b": 2, "c": 2.3}, {"a": "y", "b": 3, "c": 3.4} ]
    },
    {
      "name": "csv1",
      "format": "csv",
      "schema": { "fields": [ {"name": "a", "type": "integer"}, {"name": "b", "type": "integer"} ] },
      "path": "x1.csv"
    },
    {
      "name": "csv2",
      "format": "csv",
      "schema": { "fields": [ {"name": "c", "type": "integer"}, {"name": "d", "type": "integer"} ] },
      "path": "x2.csv"
    }
  ]
}

stored on the disk as

$ tree
.
├── package.json
├── scripts
│   └── test.r
├── x1.csv
└── x2.csv

we can:

$ dpm publish
dpm http PUT http://registry.standardanalytics.io/mydpkg/0.0.0
dpm http 201 http://registry.standardanalytics.io/mydpkg/0.0.0
+ mydpkg@0.0.0

and reclone it:

$ dpm clone mydpkg
dpm http GET http://registry.standardanalytics.io/mydpkg?clone=true
dpm http 200 http://registry.standardanalytics.io/mydpkg?clone=true
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0/debug
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0/debug
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0/csv2
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0/csv2
.
└─┬ mydpkg
  ├── package.json
  ├─┬ scripts
  │ └── test.r
  ├── x1.csv
  └── x2.csv

But to save space or maybe because you just need 1 resource, you can also simply ask to get a package.json where all the resource data have been replaced by and URL.

$ dpm get mydpkg
dpm http GET http://registry.standardanalytics.io/mydpkg
dpm http 200 http://registry.standardanalytics.io/mydpkg
.
└─┬ mydpkg
  └── package.json

For instance (using jsontool)

$ cat mydpkg/package.json | json resources | json -c 'this.name === "csv1"' | json 0.url

returns:

http://registry.standardanalytics.io/mydpkg/0.0.0/csv1

Then you can consume the resources you want with the module data-streams.

On the opposite, you can also cache all the resources data (including external URLs) in a standard directory structure, available for all the data packages stored on the registry.

$ dpm get mydpkg --cache
dpm http GET http://registry.standardanalytics.io/mydpkg
dpm http 200 http://registry.standardanalytics.io/mydpkg
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0/inline
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0/csv2
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0/inline
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0/csv2
.
└─┬ mydpkg
  ├── package.json
  └─┬ data
    ├── inline.json
    ├── csv1.csv
    └── csv2.csv

Each resources of package.json now have a path property. For instance

$ cat mydpkg/package.json | json resources | json -c 'this.name === "csv1"' | json 0.path

returns

data/csv1.csv

Given a package.json with

{
  "name": "test",
  "version": "0.0.0",
  "dataDependencies": {
    "mydpkg": "0.0.0"
  }
}

one can run

$ dpm install
dpm http GET http://registry.standardanalytics.io/versions/mydpkg
dpm http 200 http://registry.standardanalytics.io/versions/mydpkg
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0
.
├── data_modules
└─┬ mydpkg
  └── package.json

Combined with the --cache option, you get:

$ dpm install --cache
dpm http GET http://registry.standardanalytics.io/versions/mydpkg
dpm http 200 http://registry.standardanalytics.io/versions/mydpkg
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0/inline
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0/csv2
dpm http GET http://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0/inline
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0/csv1
dpm http 200 http://registry.standardanalytics.io/mydpkg/0.0.0/csv2
.
├── data_modules
└─┬ mydpkg
  ├── package.json
  └─┬ data
    ├── inline.json
    ├── csv1.csv
    └── csv2.csv

dpm aims to bring all the goodness of the npm workflow for your data needs. Run dpm --help to see the available options.

You can also use dpm programaticaly.

var Dpm = require('dpm2);
var dpm = new Dpm(conf);

See bin/dpm for examples.

dpm use the dataDependencies property of package.json and store the dependencies in a data_modules/ directory so it can be used safely, without conflict as a post-install script of npm.

Registry

By default, dpm uses our CouchDB powered data registry hosted on cloudant.

Why dpm2 ?

There is already a dpm being developed here but it leverages npm and the npm registry.

License

MIT