couchimport

Introduction

When populating CouchDB databases, often the source of the data is initially some JSON documents in a file, or some structured CSV/TSV data from another database's export.

couchimport is designed to assist with importing such data into CouchDB efficiently. Simply pipe a file full of JSON documents into couchimport, telling it the URL and database to send the data to and couchimport will batch the documents into batches and employ CouchDB's bulk import API.

Note: couchimport used to handle the CSV to JSON conversion, but this part is now handled by csvtojsonlines, keeping this package smaller and easier to maintain. The couchimport@1.6.5 package is the last version to support CSV/TSV natively - from 2.0 onwards, couchimport is only for pouring JSONL files into CouchDB.

Also note: the companion CSV export utility (couchexport) is now hosted at couchcsvexport.

Installation

Install using npm or another Node.js package manager:

npm install -g couchimport

Usage

couchimport can either read JSON docs (one per line) from stdin e.g.

cat myfile.json | couchimport

or by passing a filename as the last parameter:

couchimport myfile.json

couchimport's configuration parameters can be stored in environment variables or supplied as command line arguments.

Configuration - environment variables

Simply set the COUCH_URL environment variable e.g. for a hosted Cloudant/CouchDB database

export COUCH_URL="https://myusername:myPassw0rd@myhost.cloudant.com"

and define the name of the CouchDB database to write to by setting the COUCH_DATABASE environment variable e.g.

export COUCH_DATABASE="mydatabase"

Simply pipe the text data into "couchimport":

cat mydata.jsonl | couchimport

Configuring - command-line options

Supply the --url and --database parameters as command-line parameters instead:

couchimport --url "http://user:password@localhost:5984" --database "mydata" mydata.jsonl

or by piping data into stdin:

cat mydata.jsonl | couchimport --url "http://user:password@localhost:5984" --database "mydata"

A common usage pattern is to store the URL as an environment variable and to supply the database name as a command-line parameter e.g.

export COUCH_URL="http://user:password@localhost:5984"
cat myfile1.json | couchimport --db db1
cat myfile2.json | couchimport --db db2

Handling CSV/TSV data

We can use another package csvtojsonlines to convert CSV/TSV files into a JSONL stream acceptable to couchimport:

# CSV file ----> JSON lines ---> CouchDB
cat transactions.csv | csvtojsonlines --delimiter ',' | couchimport --db ledger

Generating random data

couchimport can be paired with datamaker to generate any amount of sample data:

# template ---> datamaker ---> 100 JSON docs ---> couchimport ---> CouchDB
echo '{"_id":"{{uuid}}","name":"{{name}}","email":"{{email true}}","dob":"{{date 1950-01-01}}"}' | datamaker -f json -i 100 | couchimport --db people
written {"docCount":100,"successCount":1,"failCount":0,"statusCodes":{"201":1}}
written {"batch":1,"batchSize":100,"docSuccessCount":100,"docFailCount":0,"statusCodes":{"201":1},"errors":{}}
Import complete

or with the template as a file:

cat template.json | datamaker -f json -i 10000 | couchimport --db people

Understanding errors

We know if we get an HTTP 4xx/5xx response, then all of the documents failed to be written to the database. But as couchimport is writing data in bulk, the bulk request may get an HTTP 201 response that doesn't mean that all of the documents were written. Some of the document ids may have been in the database already. So the couchimport output includes counts of the number of documents that were written successfully and the number that failed, and a tally of the HTTP response codes and individual document error messages:

e.g.

written {"batch":10,"batchSize":1,"docSuccessCount":4,"docFailCount":6,"statusCodes":{"201":10},"errors":{"conflict":6}}

The above log line shows that after the tenth batch of writes, we have written 4 documents and failed to write 6 others. There were six "conflict" errors, meaning that there was a clash of document id or id/rev combination.

Parallel writes

Older versions of couchimport supported the ability to have multiple HTTP requests in flight at any one time, but the new simplified couchimport does not. To achieve the same thing, simply split your file of JSON docs into smaller pieces and run multiple couchimport jobs:

# split large file into files 1m lines each
# this will create files xaa, xab, xac etc
split -l 1000000 massive.txt
# find all files starting with x and using xargs,
# spawn a max of 2 process at once running couchimport, 
# one for each file
find . -name "x*" | xargs -t -I % -P 2 couchimport --db test %

Environment variables reference

COUCH_URL - the url of the CouchDB instance (required, or to be supplied on the command line)
COUCH_DATABASE - the database to deal with (required, or to be supplied on the command line)
COUCH_BUFFER_SIZE - the number of records written to CouchDB per bulk write (defaults to 500, not required)
IAM_API_KEY - to use IBM IAM to do authentication, set the IAM_API_KEY to your api key and a bearer token will be used in the HTTP requests.

Command-line parameters reference

You can also configure couchimport using command-line parameters:

--help - show help
--url/-u - the url of the CouchDB instance (required, or to be supplied in the environment)
--database/--db/-d - the database to deal with (required, or to be supplied in the environment)
--buffer/-b - the number of records written to CouchDB per bulk write (defaults to 500, not required)

Using programmatically

In your project, add couchimport into the dependencies of your package.json or run npm install --save couchimport. In your code, require the library with

const couchimport = require('couchimport')

and your options are set in an object whose keys are the same as the command line parameters:

e.g.

const opts = { url: "http://localhost:5984", database: "mydb", rs: fs.createReadStream('myfile.json') }
await couchimport(opts)

Note: rs is the readstream where data will be read (default: stdin) and ws is the write stream where the output will be written (default: stdout)

couchimport

couchimport

Introduction

Installation

Usage

Configuration - environment variables

Configuring - command-line options

Handling CSV/TSV data

Generating random data

Understanding errors

Parallel writes

Environment variables reference

Command-line parameters reference

Using programmatically

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

Weekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

couchimport

couchimport

Introduction

Installation

Usage

Configuration - environment variables

Configuring - command-line options

Handling CSV/TSV data

Generating random data

Understanding errors

Parallel writes

Environment variables reference

Command-line parameters reference

Using programmatically

Readme

Keywords

Package Sidebar

Install

Repository

Homepage

DownloadsWeekly Downloads

Version

License

Unpacked Size

Total Files

Last publish

Collaborators

Weekly Downloads