REEF continues the trend to re-layer the big data stack.

What is REEF?

REEF stands for the Retainable Evaluator Execution Framework, and it is our approach to simplify and unify the lower layers of big data systems on modern resource managers like Apache YARN, Apache Mesos, Google Omega, and Facebook Corona. On these resource managers, REEF provides a centralized control plane abstraction that can be used to build a decentralized data plane for supporting big data systems, like those mentioned below. Special consideration is given to graph computation and machine learning applications, which require data retention on allocated resources, as they execute multiple passes over the data.

More broadly, applications that run on YARN will have the need for a variety of data-processing tasks e.g., data shuffle, group communication, aggregation, checkpointing, and many more. Rather than reimplement these for each application, REEF aims to provide them in a library form, so that they can be reused by higher-level applications and tuned for a specific domain problem e.g., Machine Learning.

In that sense, our long-term vision is that REEF will mature into a Big Data Application Server, that will host a variety of tool kits and applications, on modern resource managers.

Who is it for?

REEF is for developers of data processing systems on cloud computing platforms that provide fine-grained resource allocations. REEF provides system authors with a centralized (pluggable) control flow that embeds a user-defined system controller called the Job Driver. The interfaces associated with the Job Driver are event driven; events signal resource allocations and failures, various states associated with task executions and communication channels, alarms based on wall-clock or logical time,  and so on. REEF also aims to package a variety of data-processing libraries (e.g., high-bandwidth shuffle, relational operators, low-latency group communication, etc.) in a reusable form. Authors of big data systems and toolkits can leverage REEF to immediately begin development of their application specific data flow,  while reusing packaged libraries where they make sense.

Why did it come about?

Traditional data-processing systems are built around a single programming model (like SQL or MapReduce) and a runtime (query) engine. These systems assume full ownership over the machine resources used to execute compiled queries. For example, Hadoop (version one) supports the MapReduce programming model, which is used to express a jobs that execute a map step followed by an optional reduce step. Each step is carried out by some number of parallel tasks. The Hadoop runtime is built on a single master (the JobTracker) that schedules map and reduce tasks on a set of workers (TaskTrackers) that expose fixed-sized task “slots”. This design leads to three key problems in Hadoop:

  1. The resources tied to a TaskTracker are provisioned for MapReduce only.
  2. Clients must speak some form of MapReduce in order to make use of cluster resources, and in turn, gain compute access to the data that lives there.
  3. Poor cluster utilization, especially in the case of idle resources (slots) due to straggler tasks.

With YARN (Hadoop version two), resource management has been decoupled from the MapReduce programming model in Hadoop, freeing cluster resources from slotted formats, and opening the door to programming frameworks beyond MapReduce. It is well understood that while enticingly simple and fault-tolerant, the MapReduce model is not ideal for many applications, especially iterative or recursive workloads like machine learning and graph processing, and those that tremendously benefit from main memory (as opposed to disk based) computation. A variety of big data systems stem from this insight: Microsoft’s Dryad, Apache Spark, Google’s Pregel, CMU’s GraphLab and UCI’s AsterixDB, to name a few. Each of these systems add unique capabilities, but form islands around key functionalities, making it hard to share both data and compute resources between them. YARN, and related resource managers, move us one step closer toward a unified Big Data system stack. The goal of REEF is to provide the next level of detail in this layering.