AWS Data Science
Pragmatic take on being a data scientist for AWS-based applications and systems. Basically you can take typical AWS data sources, apply transformations, and gather results into reports.
While AWS does indeed offer some services for the data handling on different scales, a data scientist might want to crunch the data on-demand, getting a feeling for things and answer questions right away before implementing bigger architectural systems.
Just a glimpse for you how it feels like to use this library (via typescript):
As you might notice, these are some functional building blocks implemented
on top of the node.js
stream module with a charming API. When using via
typescript, there are also generics leveraged to aid you when building
your data pipelines (less debugging), this is optional for plain JS. Also,
you never have to implement
.on('end') event handlers
when using Collectors, since they expose a
.promise() which can be awaited
It is also quite easy to parallelize multiple pipelines: don't
one, but stuff handlers of many pipelines into an array and await all of
them as you wish, like with
npm install -S aws-data-science
This package also requires a peer dependency of
Data Sources ("Origins")
All data sources (called "Origins") implement the
and must be the starting point of all data analysing efforts. The following
data sources can be used currently for data mining:
Origin.Array: start stream from simple arrays
Origin.String: start stream from string, emits words
Origin.CloudWatchLog: stream CloudWatchLog entries
- CloudFront Logs (via S3)
- CloudTrail Logs
- Billing API
- DynamoDB Tables
On every data stream, you can apply as many transformation steps as you wish.
Since the stream
pipe data flow model applies backpressure nicely for you,
your computer should handle practically infinite loads of data without hassle.
Transform.Map: same as
Transform.Filter: same as
Transform.ParseLambdaLog: unifies multi-line event outputs from Lambda
This is where data mining comes into play. You can
pipe you data stream
into several "Aggregators" to generate additional data, for example counting
even numbers in a number stream. Or occurences of words within a text corpus.
Aggregate.Count: count truthy statements in stream
Aggregate.List: store things from the stream in an array
Aggregate.Mean: count numbers from the stream and return the mean value
Aggregate.Rank: count occurences of things and sort by highscore
Aggregate.Sum: add all numbers in a stream
Once your data pipeline has done everything you want, you must choose where the data should end up. You might collect anything in an in-memory array, or store stuff in files, or even discard everything completely since you only want some aggregated informations.
Collect.Array: stream sink as simple array
Collect.JsonFile: stream sink directly into JSON array file
Collect.Nothing: when you don't need the data any longer