gulp-etl-splitfile
Split a single Message Stream file into multiple files. Ideal for chunking a stream into smaller pieces for manageability of file sizes or upload runs to database, or for "grouping" lines into files based on properties or values
This is a gulp-etl plugin, and as such it is a gulp plugin. gulp-etl plugins processes ndjson data streams/files which we call Message Streams and which are compliant with the Singer specification. Message Streams look like this:
{"type": "SCHEMA", "stream": "users", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}{"type": "RECORD", "stream": "users", "record": {"id": 1, "name": "Chris"}}{"type": "RECORD", "stream": "users", "record": {"id": 2, "name": "Mike"}}{"type": "SCHEMA", "stream": "locations", "key_properties": ["id"], "schema": {"required": ["id"], "type": "object", "properties": {"id": {"type": "integer"}}}}{"type": "RECORD", "stream": "locations", "record": {"id": 1, "name": "Philadelphia"}}{"type": "STATE", "value": {"users": 2, "locations": 1}}
Usage
const splitFile = splitFile; // javascript; // typescript
gulp-etl plugins accept a configObj as its first parameter. The configObj will contain any info the plugin needs.
Available configObj properties for this plugin:
index:number
- The maximum number of lines in each new file. Cannot be combined withgroupBy
.
// Split out a new file every 1000 lines// cause error by using groupBy and index together// default (no options): split out a new file for every line
groupBy:string|array
- Value(s) in lines to split lines between files; uses JSONSelect. Cannot be combined withindex
.
// group by (split lines to new files based on) the value of the "type" property of each line// group by `type` and then `stream`// group by `record.name` property)// group by `record.Last Name`, and/or by `type` (if it is equal to "STATE")
separator:string
- Character(s) to separate sections of file names
// splitting `file.ndjson`// -> `file_0.ndjson`, `file_1.ndjson`... (this is the default)// -> `file-0.ndjson`, `file-1.ndjson`...// -> `file-SCHEMA.ndjson`, `file-RECORD.ndjson`...// -> `file_SCHEMA_users.ndjson`, `file-RECORD_users.ndjson`...
timeStamp:boolean
- Add a shortened string to all filenames based on the current time? use to keep successive runs from overwriting results from those before
// -> `file_l4514_fe_0.ndjson`, `file_l4514_fe_1.ndjson`...
Quick Start
- Dependencies:
- Clone this repo and run
npm install
to install npm packages - Debug: with VScode use
Open Folder
to open the project folder, then hit F5 to debug. This runs without compiling to javascript using ts-node - Test:
npm test
ornpm t
- Compile to javascript:
npm run build
- Run using included test data (be sure to build first):
gulp --gulpfile debug/gulpfile.ts
Testing
We are using Jest for our testing. Each of our tests are in the test
folder.
- Run
npm test
to run the test suites note: tests are currently broken
Notes
Note: This document is written in Markdown. We like to use Typora and Markdown Preview Plus for our Markdown work.