Watermill: A Streaming Workflow Engine
Watermill lets you orchestrate tasks using operators like join, junction, and fork. Each task has a lifecycle where
- Input glob patterns are resolved to absolute file paths (e.g.
- The operation is ran, passed resolved input, params, and other props
- The operation completes.
- Output glob patterns are resolved to absolute file paths.
- Validators are ran over the output. Check for non-null files, can pass in custom validators.
- Post-validations are ran. Add task and output to DAG.
What is a task?
task is the fundamental unit pipelines are built with. For more details, see Task. At a glance, a task is created by passing in props and an operationCreator, which will later be called with the resolved input. Consider this task which takes a "lowercase" file and creates an "uppercase" one:
const uppercase =
A "task declaration" like above will not immediately run the task. Instead, the task declaration returns an "invocable task" that can either be called directly or used with an orchestration operator. Tasks can also be created to run shell programs:
const fastqDump =
What are orchestrators?
Orchestrators are functions which can take tasks as params in order to let you compose your pipeline from a high level view. This separates task order from task declaration. For more details, see Orchestration. At a glance, here is a complex usage of
const pipeline =
- Toy pipeline with shell/node
- Simple capitalize task
- Simple SNP calling
- SNP calling with filtering and fork
Who is this tool for?
Waterwheel is for biologists who understand it is important to experiment with sample data, parameter values, and tools. Compared to other workflow systems, the ease of swapping around parameters and tools is much improved, allowing you to iteratively compare results and construct more confident inferences. Consider the ability to construct your own Teaser for your data with a simple syntax, and getting utmost performance out of the box.