October 4, 2018
Distributed, Incremental Dataflow Processing in the Cloud with Reflow
Many data processing tasks involve piecing together many different software packages in increasingly complex configurations. Called workflows or pipelines, these data processing tasks are often managed by workflow management systems that require users to construct explicit dependency graphs between many components; these management systems in turn rely on some underlying cluster computing fabric to distribute load.
In this talk, we’ll present Reflow, a domain specific language and runtime system designed for such data processing tasks. Reflow treats “workflow programming” like any other programming tasks: instead of declaring explicit dependency graphs, Reflow users use functional programming to directly specify their data pipelines.
Reflow presents a vertically integrated environment for data processing, comprising: a strong data model based on dataflow semantics; built-in cluster management and job scheduling; fully incremental computing; and a module system that promotes modern software engineering practices like testing and modularity.
We’ll talk about how Reflow’s approach allowed us to implement a powerful data processing system with comparatively little code and few dependencies. Particularly we’ll discuss how the co-design of Reflow’s language, runtime, and cluster management systems resulted in a simple and comprehensible systems design. We’ll then discuss how Reflow supports our large-scale data processing environment.
Marius Eriksen is a principal software engineer at GRAIL, working primarily on large-scale data processing infrastructure. Prior to GRAIL, Marius was a principal engineer at Twitter, where he was the primary author of Finagle, a library similar in spirit to our own Async library that is at the heart of many of Twitter's services. He's a long-time devotee of functional programming, having written a geospatial database in Haskell for a company called Mixer Labs which was subsequently acquired by Twitter, and since then having worked extensively in Scala, a mixed functional/oo language based on the JVM. More recently, his work on Strato (Twitter) and Reflow (GRAIL) has brought ideas in functional and incremental programming to new domains. Marius is also an old-time OpenBSD kernel hacker, among other things having ported VMWare kernel support to OpenBSD.