confluence-mdk

3.3.1 • Public • Published

confluence-mdk

This tool allows you to export the page structure and contents of Wiki pages from a Confluence space as RDF and upload the data along with a predefined ontology to a Neptune database. It offers a CLI as well as an API for node.js. It can be installed thru npm/yarn, dockerhub, or built from source.

Contents

Install from Docker Hub

Running this tool as a docker container is the simplest method for getting started.

Requirements:

Install:

$ docker pull openmbee/confluence-mdk:latest

Prepare: Create a file to store configuration and user credentials that the tool will use to connect to Confluence wiki (remove the export keywords from the example environment variables file) and name the file .docker.env, then pass it into the docker run command like so:

$ docker run -it --init --rm --env-file .docker-env openmbee/confluence-mdk:latest export --help

The above shell command will print the help message for the export command.

The -it --init options will allow you to interactively cancel and close the command while it is running through your terminal.

The --rm option will remove the stopped container from your file system once it exits.

The --env-file .docker-env option points docker to your environments variables file.

Install from NPM/Yarn

Requirements:

  • Node.js >= v14.13.0

If running on a personal machine and you do not already have Node.js installed, webi is the recommended install method since it will automatically configure node and npm for you: https://webinstall.dev/node/

Install the package globally:

$ npm install -g confluence-mdk

Confirm the CLI is linked:

$ confluence-mdk --version

If the above works, congrats! You're good to go.

However, if you got an error, it is likely that your npm has not yet been configured on where to put global packages.

For Linux and MacOS:

$ mkdir ~/.npm-global
$ echo -e "export NPM_CONFIG_PREFIX=~/.npm-global\nexport PATH=\$PATH:~/.npm-global/bin" >> ~/.bashrc
$ source ~/.bashrc

Install from source

This approach is for developers who wish to edit the source code for testing changes.

From the project's root directory:

$ npm install

To link the CLI, you can use:

$ npm link

If running on a personal machine, it is suggested to set your npm prefix so that the CLI is not linked globally.

CLI

The CLI has several commands, most having subcommands:

confluence-mdk <command>

Commands:
  confluence-mdk wiki <subcommand>     Manipulate the Confluence Wiki
  confluence-mdk s3 <subcommand>       Control a remote S3 Bucket
  confluence-mdk neptune <subcommand>  Control a remote AWS Neptune triplestore
  confluence-mdk import                Import an exported dataset into a Neptune database
                                       (composition of `s3` and `neptune` commands above)

Options:
  --version  Show version number                                       [boolean]
  --help     Show help                                                 [boolean]

Environment Variables

For local testing, it is recommended that your create a .env file with all the environment variables (docker users skip this step):

For Linux and MacOS:

#!/bin/bash
export CONFLUENCE_SERVER=https://wiki.xyz.org

###############################
# user/pass
export CONFLUENCE_USER=user
export CONFLUENCE_PASS=pass

# OR, using a personal access token
export CONFLUENCE_TOKEN=<yourPersonalAccessToken>
###############################

export NEPTUNE_S3_BUCKET_URL=s3://my-bucket
export NEPTUNE_S3_IAM_ROLE_ARN=arn:aws-us-gov:iam::123456784201:role/NeptuneLoadFromS3

export AWS_REGION=us-east-1
export AWS_ACCESS_KEY_ID=AKIAZH1AZYX1BABA1AB2
export AWS_SECRET_ACCESS_KEY=hoijAF/sEcRetAcc3SsKeYz/sjoAFNOJo18SOjos

export SPARQL_ENDPOINT=https://my-sparql-endpoint.us-east-1.neptune.amazonaws.com:8182
export SPARQL_PROXY=socks5://127.0.0.1:3032

Then, simply $ source .env before running the CLI.

For Windows, use set instead of export, for example:

set CONFLUENCE_SERVER=https://wiki.xyz.org

# user/pass
set CONFLUENCE_USER=user
set CONFLUENCE_PASS=pass

# OR, using a personal access token
set CONFLUENCE_TOKEN=<yourPersonalAccessToken>

CLI: wiki

Use confluence-mdk wiki --help for the latest documentation about this command's options.

CLI: wiki export

Export the contents of the given page (and optionally all of its descdendents using the --recurse flag), as well as the wiki structure between them (i.e., the parent/child relationships).

Say we have a root wiki page at https://wiki.xyz.org/display/somespace/PageTitle on our server and we want to export it along with all of its descendents:

$ confluence-mdk wiki export https://wiki.xyz.org/display/somespace/PageTitle --recurse > wiki-export.ttl

CLI: wiki child-pages

Print a line-delimited list (or as JSON array using --json flag) of page IDs (or as URLs using --urls flag) of the target's child pages.

CLI: s3

Use confluence-mdk s3 --help for the latest documentation about this command's options.

This command provides some basic control over an S3 bucket for uploading RDF data from your local machine.

CLI: s3 upload-data

Uploads the Turtle file on stdin to the configured S3 bucket (overwriting the existing object).

Example:

$ confluence-mdk s3 upload-data  \
    --prefix="confluence/rdf/"  \
    --graph="https://wiki.xyz.org/display/somespace/MainPage"  \
    https://wiki.xyz.org/display/somespace/MainPage  < wiki-export.ttl

CLI: s3 upload-ontology

Uploads the static (prebuilt) ontology to the configured S3 bucket (overwriting the existing object)

CLI: neptune

Use confluence-mdk neptune --help for the latest documentation about this command's options.

This command provides some basic control over a Neptune instance for clearing a graph and then triggering Neptune's bulk loader on an S3 bucket.

CLI: neptune clear

Clear the given named graph.

Example:

$ confluence-mdk neptune clear --graph="https://wiki.xyz.org/display/somespace/MainPage"

CLI: neptune load

Bulk loads the ontology and data from S3 into the given named graph.

Example:

$ confluence-mdk neptune load --graph="https://wiki.xyz.org/display/somespace/MainPage" --bucket "s3://bucket-uri"

CLI: import

This is simply a convenience command which is equivalent to calling the following commands in order (passing in all relevant options such as --prefix and --graph):

  1. confluence-mdk s3 upload-data < {STDIN}
  2. confluence-mdk s3 upload-ontology
  3. confluence-mdk neptune clear
  4. confluence-mdk neptune load

Outputs are logged to stdout.

s3, neptune and import Options:

  • --prefix -- string to prepend to the S3 objects, e.g., my-folder/
  • --graph -- IRI of the named graph to load all the RDF data into, e.g., https://wiki.xyz.org/display/Space+Rocks
  • --region -- AWS region of the S3 bucket and Neptune cluster (they must be in the same region). defaults to AWS_REGION env var otherwise
  • --bucket -- the AWS s3://... bucket URI. defaults to NEPTUNE_S3_BUCKET_URL env var otherwise
  • --sparql-endpoint -- the public URL to the SPARQL endpoint exposed by the Neptune cluster. defaults to SPARQL_ENDPOINT env var otherwise
  • --neptune-s3-iam-role-arn -- the ARN for an IAM role to be assumed by Neptune instance for access to S3 bucket. defaults to NEPTUNE_S3_IAM_ROLE_ARN env var otherwise

s3, neptune and import Environment variables:

  • NEPTUNE_REGION - the AWS region in which the Neptune cluster is located deprecated; use AWS_REGION instead
  • AWS_REGION - the AWS region in which the Neptune cluster and S3 bucket are colocated
  • NEPTUNE_S3_BUCKET_URL - the s3://... bucket URL
  • NEPTUNE_S3_IAM_ROLE_ARN - the ARN associated with the Neptune cluster's role for loading data from S3
  • AWS_ACCESS_KEY_ID - AWS access key id
  • AWS_SECRET_ACCESS_KEY - AWS secret access key
  • SPARQL_ENDPOINT - the public URL to the SPARQL endpoint exposed by the Neptune cluster
  • SPARQL_PROXY - optional URL to a proxy used for sending requests to SPARQL endpoint (requests must originate from a machine within same VPC as cluster, using proxy here allows you to send HTTP(S) requests thru ssh tunnel you open to ec2 machine)

API: wikiExport

Fetch the metadata and contents of the given page as well as all of its children, then produce an RDF representation of that information serialized as Turtle.

async function wikiExport(options: ExportConfig) => Promise<void>

Example:

import {
  wikiExport,
} from 'confluence-mdk';

(async() => {
  await wikiExport({
    page: 'https://wiki.xyz.org/pages/viewpage.action?pageId=12345',
    user: process.env.CONFLUENCE_USER,
    pass: process.env.CONFLUENCE_PASS,
    output: fs.createWriteStream('./export.ttl'),
  });
})();

Or, if using commonjs:

const {
  wikiExport,
} = require('confluence-mdk');

API: wikiChildPages

Retrieve the page IDs for the child pages of the given Confluence page.

async function wikiExport(options: ExportConfig) => Promise<string[]>

API: ExportConfig

is defined by the interface:

  • 'page': string - URI, space/title, or page id of the root page to export
  • 'server'?: string - optional URI origin of the Confluence server. can be ommitted if a URI is passed to page
  • 'token'?: string - personal access token to use instead of user/pass. defaults to CONFLUENCE_TOKEN env var otherwise
  • 'user'?: string - username to use for basic auth. defaults to CONFLUENCE_USER env var otherwise
  • 'pass'?: string - password to use for basic auth. defaults to CONFLUENCE_PASS env var otherwise
  • 'output'?: stream.Writable - optional writable stream to output the RDF. defaults to stdout
  • 'recurse'?: boolean - optional whether or not to recursively export the children of this page. defaults to false
  • 'concurrency'?: number - optional maximum HTTP request concurrency to use when crawling
  • 'as_urls'?: boolean - optional only applies to wikiChildPages, returns child pages as URLs instead of page IDs

API: s3UploadData

Uploads the given Turtle input stream to the configured S3 bucket (overwriting the existing data.ttl object).

async function s3UploadData(options: ImportConfig) => Promise<void>

API: s3UploadOntology

Uploads the given Turtle input stream to the configured S3 bucket (overwriting the existing ontology.ttl object).

async function s3UploadOntology(options: ImportConfig) => Promise<void>

API: neptuneClear

Clears the given named graph on the Neptune database.

async function neptuneClear(options: ImportConfig) => Promise<SPARQLUpdateResponseData>

API: neptuneLoad

Loads all objects with the given S3 prefix into the given named graph on the Neptune database.

async function neptuneLoad(options: ImportConfig) => Promise<BulkLoadResult>

API: runImport

Runs the above functions in order. All together this will upload the given Turtle input stream along with the fixed ontology to the configured S3 bucket (overwriting existing objects), clear the given named graph, then bulk load the data from S3 into the given named graph.

async function runImport(options: ImportConfig) => Promise<ImportResults>

See ImportConfig here.

Where ImportResults will be an object with the following format:

  • 'clear' - demarshalled JSON response from issuing the SPARQL command that clears the triples in the named graph
  • 'load' - demarshalled JSON response from the bulk upload command that loads data into the named graph from the S3 bucket

Example:

import {
  runImport,
} from 'confluence-mdk';

(async() => {
  await runImport({
    prefix: 'confluence/rdf/',
    graph: 'https://wiki.xyz.org/display/wip/World+Domination',
    input: fs.createReadStream('./export.ttl'),
  });
})();

Or, if using commonjs:

const {
  runImport,
} = require('confluence-mdk');

API: ImportConfig

is defined by the interface:

  • 'prefix': string - S3 object key prefix, e.g., "confluence/rdf/" . in this example, notice the trailing slash to specify a folder; you can specify the full object instead e.g., "confluenc/rdf/data.ttl"
  • 'graph': string - IRI of the named graph to contain the triples, best practice is to use URI of the Wiki "space". this named graph will be cleared before being populated with the ontology and data
  • 'input'?: stream.Readable - optional readable stream to input the RDF data to be uploaded
  • 'region'?: string - AWS region of the S3 bucket and Neptune cluster (they must be in the same region). defaults to AWS_REGION env var otherwise
  • 'bucket'?: string - the AWS s3://... bucket URI. defaults to NEPTUNE_S3_BUCKET_URL env var otherwise
  • 'sparql_endpoint'?: string - the public URL to the SPARQL endpoint exposed by the Neptune cluster. defaults to SPARQL_ENDPOINT env var otherwise
  • 'neptune_s3_iam_role_arn'?: string - the ARN for an IAM role to be assumed by Neptune instance for access to S3 bucket. defaults to NEPTUNE_S3_IAM_ROLE_ARN env var otherwise

Package Sidebar

Install

npm i confluence-mdk

Weekly Downloads

9

Version

3.3.1

License

Apache-2.0

Unpacked Size

83.6 kB

Total Files

17

Last publish

Collaborators

  • blake.regalia