S3 Object Streams
A small Node.js package that can be helpful when performing operations on very large S3 buckets, those containing millions of objects or more, or when processing S3 Inventory listings for those buckets. Streaming the listed contents keeps memory under control and using the Streams API allows for fairly compact utility code.
For very large buckets, S3 Inventory is always a better choice.
Installing
Obtain via NPM:
npm install s3-object-streams
S3ListObjectStream
An object stream that pipes in configuration objects for listing the contents of an S3 bucket, and pipes out S3 object definitions.
var AWS = ;var s3ObjectStreams = ; var s3ListObjectStream = ;var s3Client = ; // Log all of the listed objects.s3ListObjectStream;s3ListObjectStream;s3ListObjectStream; // List the contents of a couple of different buckets.s3ListObjectStream;s3ListObjectStream;s3ListObjectStream;
Objects emitted by the stream have the standard format, with the addition of a
Bucket
property:
Bucket: 'exampleBucket1' Key: 'examplePrefix/file.txt' LastModified: Date ETag: 'tag string' Size: 200 StorageClass: 'STANDARD' Owner: DisplayName: 'exampleowner' ID: 'owner ID'
S3ConcurrentListObjectStream
This works in the same way as the S3ListObjectStream
, but under the hood it
splits up the bucket by common prefixes and then recursively lists objects under
each common prefix concurrently, up to the maximum specified concurrency.
var AWS = ;var s3ObjectStreams = ; var s3ConcurrentListObjectStream = // Optional, defaults to 15. maxConcurrency: 15;var s3Client = ; // Log all of the listed objects.s3ConcurrentListObjectStream;s3ConcurrentListObjectStream;s3ConcurrentListObjectStream; // List the contents of a couple of different buckets.s3ConcurrentListObjectStream;s3ConcurrentListObjectStream;s3ConcurrentListObjectStream;
Objects emitted by the stream have the standard format, with the addition of a
Bucket
property:
Bucket: 'exampleBucket1' Key: 'examplePrefix/file.txt' LastModified: Date ETag: 'tag string' Size: 200 StorageClass: 'STANDARD' Owner: DisplayName: 'exampleowner' ID: 'owner ID'
S3UsageStream
A stream for keeping a running total of count and size of listed S3 objects by bucket and key prefix. Useful for applications with a UI that needs to track progress.
var AWS = ;var s3ObjectStreams = ; var s3ListObjectStream = ;var s3UsageStream = // Determine folders from keys with this delimiter. delimiter: '/' // Group one level deep into the folders. depth: 1 // Only send a running total once every 100 objects. outputFactor: 100;var s3Client = ; s3ListObjectStream; var runningTotals; // Log all of the listed objects.s3UsageStream;s3UsageStream;s3UsageStream; // Obtain the total usage for these two buckets.s3ListObjectStream;s3ListObjectStream;s3ListObjectStream;
The running total objects emitted by the stream have the following format:
path: 'exampleBucket/folder1' storageClass: STANDARD: // The number of files of this storage class. count: 55 // Total size in bytes of files in this storage class. size: 1232983 STANDARD_IA: count: 0 size: 0 REDUCED_REDUNDANCY: count: 2 size: 5638 GLACIER: count: 0 size: 0
S3InventoryUsageStream
A stream for keeping a running total of count and size of S3 objects by bucket
and key prefix, accepting objects from an S3 Inventory CSV file rather than
from the listObjects
API endpoint.
// Core.var fs = ;var zlib = ; // NPM.var csv = ;var _ = ;var s3ObjectStreams = ; // Assuming that we already have the manifest JSON and a gzipped CSV data file// downloaded from S3:var manifest = ;var readStream = fs; var s3InventoryUsageStream = // Determine folders from keys with this delimiter. delimiter: '/' // Group one level deep into the folders. depth: 1 // Only send a running total once every 100 objects. outputFactor: 100; var runningTotals; // Log all of the listed objects.s3UsageStream; var complete = _; var gunzip = zlib;var csvParser = csv; csvParser;gunzip;readStream;s3InventoryUsageStream;s3InventoryUsageStream; // Unzip the file on the fly, feed it to the csvParser, and then into the// object stream.readStream;
The running total objects emitted by the stream have the following format:
path: 'exampleBucket/folder1' storageClass: STANDARD: // The number of files of this storage class. count: 55 // Total size in bytes of files in this storage class. size: 1232983 STANDARD_IA: count: 0 size: 0 REDUCED_REDUNDANCY: count: 2 size: 5638 GLACIER: count: 0 size: 0