A standard/convention for running tasks over a list of files based around Node core streams2
A standard/convention for running tasks over a list of files based around Node core streams2.
minitask is a library for processing tasks on files. It is used in many of my projects, such as
Most file processing tasks can be divided into three phases, and minitask provides tools for each phase:
[ 1. Directory iteration: selecting a set of files to operate on, using the List class ] [ 2. Task definition: - defining operations on files using the Task class - making use of cached results using the Cache class ] [ 3. Task execution: - executing operations in parallel using the Runner class - storing cached results using the Cache class ]
Separating these into distinct phases has several advantages. The main advantage is that each of these operations can be written independently of the other two: e.g. no task definition during iteration and no execution parallelism concerns during task definition.
Further, separating task definition from execution allows for much greater execution parallelism compared to a naive sequential stream processing implementation. This means faster builds.
The List API essentially consists of:
addfunction which adds path targets
findwhich select files
execfunction which performs the actual traversal
A few notes:
.filteron the result)
execfunction because this allows the same List object to be run multiple times against a changing directory structure, which is nice if you are running the same operations multiple times (e.g. in a server).
The list API is documented in docs/list.md.
The Task API provides a way to express a set of transformations using an array of:
without having to worry about the details of how these things are connected. Node's duplex streams are a bit tedious for simple transforms and Node's
child_process returns something that's not quite a duplex stream. The Task API works around those limitations by providing some plumbing, and returns a queueable task object that can be run later.
A few notes:
The task API is documented in docs/task.md.
Tasks are often run multiple times without the underlying file changing, which means we can skip the work and use a cached version. The cache API handles:
The cache API supports storing result files and file metadata in a way that ensures that if the underlying file changes, the related cached data is invalidated. The input file can be checked using size + date modified, or by running a hash algorithm such as md5 on the file.
A few notes:
The cache API is documented in docs/cache.md.
The runner API is documented in docs/runner.md.