Dissertation: Design and Optimization of Scientific Workflows

[PDF - electronic version]

Abstract

This work considers the problem of design and optimization of scientific workflows. Progress in the natural sciences increasingly depends on effective and efficient means to manage and analyze large amounts of data. Scientific workflows form a crucial piece of cyberinfrastructure, which allows scientists to combine existing components (e.g., for data integration, analysis, and visualization) into larger software systems to conduct this new form of scientific discovery.

We propose VDAL (Virtual Data Assembly Lines), a dataflow-oriented paradigm for scientific workflows. In the VDAL approach, data is organized into nested collections, much like XML, and flows between components during workflow execution. Components are configured with XQuery/XPath-like expressions to specify their interaction with the data. We show how this approach addresses many challenges that are common in scientific workflow design, thus leading to better overall designs. We then study different ways to optimize VDAL execution. First, we show how to leverage parallel computing infrastructure by exploiting pipeline, task, and data parallelism exhibited by the VDAL paradigm itself. To this end, we compile VDAL workflows into several Map-Reduce tasks, executed in parallel. We then show how the cost of data-shipping can be reduced in a distributed streaming implementation. Next, we propose a formal model for VDAL, and show how static analysis can provide additional design features to support the scientist during workflow creation and maintenance, namely, by displaying actor dependencies previewing the structure of the results, and explaining how output data will be generated from input data. Consequently, certain design errors can be detected prior to the actual workflow execution. Finally, we investigate the fundamental question of how to decide equivalence of VDAL workflows. We show that testing the equivalence of string-polynomials, a new problem, reduces to workflow equivalence when an ordered data model is used. Here, our preliminary work defines several normal forms for approximating the equivalence of string polynomials.

Brief Contributions

In Chapter 1, we motivate and introduce scientific workflows and provide necessary background information. Chapters 2-5 describe Virtual Data Assembly Lines (VDAL) and present optimization strategies for their execution; Chapters 6 and 7 present the more theoretical part of this work, where we consider static analysis of VDAL workflows.

In particular, Chapter 2 presents the VDAL paradigm. We outline concrete design challenges and show, based on example scenarios, how VDAL workflows address these challenges and are thus easier to design, maintain and evolve than workflows based on plain dataflow network primitives. In Chapters 3 and 4, we describe approaches to enhance execution efficiency: Chapter 3 analyzes how the execution of VDAL workflows can be enhanced by exploiting data parallelism. Here, we show how a MapReduce framework can be used to execute VDAL workflows in a cluster environment. In Chapter 4, we present a type-system for workflows operating on ordered collections, and show how we can optimize data shipping by analyzing data-dependencies based on information provided by the actors. Chapter 5 describes the workflow execution engine developed for this dissertation. Then, Chapter 6 describes how VDAL workflows can be translated into XML-processing programs written in the XML update language FLUX. We further define concepts related to workflow well-formedness and show how existing type systems for XML languages can be used to answer important design questions. In Chapter 7, we extend existing typing approaches for XML languages by developing a sound and complete type system for a core language for XML processing. We further introduce the problem of "String-Polynomial-Equality", and show that it is at the core of deciding query equivalence for XQuery with an ordered data model. Our current results towards solving this problem conclude this chapter. Chapter 8 summarizes this work and outlines future research opportunities.

This dissertation is based on the following publications: Chapter 1: [Zin08, MBZL09]; Chapter 2: [ZBML09a]; Chapter 3: [ZBKL09]; Chapter 4: [ZBML09b]; Chapter 5: [ZLL09]; and Chapter 6: [ZBL10].

Detailed Contributions

Development of the VDAL paradigm (Chapter 2)

We first identify several design challenges (illustrated by examples) common in scientific workflow design. These challenges result from parameter-rich functions; data assembly/disassembly and data cohesion; conditional execution; iteration; and, more generally, workflow evolution. To address these, we propose VDAL (Virtual Data Assembly Lines), a dataflow-oriented approach for building scientific workflows. VDAL was inspired by and is a formal variant of Comad by McPhillips et al. [MB05, MBL06]. The key ideas of the VDAL approach are (1) data is organized in labeled nested collections flowing along a linear pipeline of workflow actors, and (2) workflow actors are wrapped inside a configurable shell that defines how the components interact with the data flowing between them. Using nested collections as a built-in data model removes the need of ad-hoc records or array structures, and thus completely removes low-level shims (that were used to manage these structures) from the workflow. Deploying declarative configurations has the advantage that data management tasks are controlled by the workflow developer at a much higher level of abstraction than it was the case when explicit shim actors were used. Another crucial advantage is that in a VDAL workflow, actors are no longer tightly coupled via explicit wiring, instead the labeled collection structure provides a level of indirection that makes workflows not only free of datamanagement shims, but also resilient to changes to the input data and to the workflow itself. As a core contribution, we describe the anatomy of VDAL components and their configurations in detail. We define how the data-management shell selects input data from the input stream, how it then iteratively invokes scientific components with the selected data, and how they the place the result data back into the nested collection structure. Moreover, we then show how this approach addresses the design challenges we identified. This work has been published as [ZBML09a].

Exploiting data parallelism (Chapter 3)

We investigate possibilities for efficient execution of VDAL workflows. To this end, we first show how to exploit the data parallelism inherent in the VDAL paradigm. We develop strategies to transform a VDAL workflow into an equivalent series of MapReductions that can be executed in parallel on a cluster computer or on a cloud infrastructure. Our strategies differ in the level of complexity of used algorithms and data structures. For each of them, we discuss advantages and trade-offs. We further conduct a thorough evaluation using the Hadoop implementation of MapReduce. Here, our experiments confirm that our approach decreases execution time for compute-intensive pipelines (speed-up factor of 20). While even our basic strategy achieves significant speed-ups (up to 17x) for workflows with relatively small collection structures, our most sophisticated strategy scales well even for very large collection structures and clearly outperforms the other strategies in these cases. These efficiency gains and other advantages of MapReduce (e.g., fault tolerance and ease of deployment) make this approach ideal for executing large-scale, compute-intensive VDAL workflows. I thank Sven Koehler for the great help with the Hadoop implementation! This work has been published as [ZBKL09].

Minimizing data shipping (Chapter 4)

In a VDAL workflow, the actor configurations describe which part of the input data is used by the actor; all not-selected data is simply ignored. Although this approach greatly helps during workflow design (see Chapter 2), it introduces unnecessary data shippings when implemented directly. Our contribution in Chapter 4 is to develop a type-system and algorithms to address this drawback. In particular, we show how to compile a VDAL workflow into an equivalent (low-level) workflow with additional data-routing components added. This low-level process network avoids unnecessary data shippings. Consequently, scientists can develop workflows using the VDAL abstractions without explicitly defining the data routing, instead the workflow system itself optimizes data shippings when the workflow is deployed and executed. Our experimental evaluation confirms the effectiveness of our approach, with savings in data shippings of up to 50% and a decrease of execution time by more than 60%. This work has been published as [ZBML09b].

Light-weight parallel workflow engine (Chapter 5)

In Chapter 5, we describe our implementation of a light-weight engine for executing process network workflows. This engine was used to perform the experiments presented in Chapter 4. Our contribution here is to demonstrate how a core library can provide a fast and scalable basis for PN, and ultimately VDAL workflows. The second contribution is to show how such an external engine can be loosely coupled to the Kepler workflow system. In contrast to current approaches, in which data movement is explicitly defined in the workflow (e.g., via scp actors), the details of data movement is handled by the back-end workflow engine in our approach. Consequently, there is a clear separation between the scientific workflow logic and details about its deployment during runtime. This does not only make workflow construction easier for the scientist, but also allows optimizations such as automatically choosing host machines as well as appropriate methods for data transfer. I thank my collaborator Xuan Li for implementing the Kepler side of the Kepler-PPN coupling. This work has been presented as [ZLL09].

Supporting Workflow Design (Chapter 6)

We designed the VDAL configurations to be expressive enough to allow common data manipulation tasks, but still analyzable by the workflow system. In Chapter 6, we illustrate how static analysis can be used to support the scientist during workflow construction and maintenance. Given a configured VDAL workflow, the workflow system can, among other things, (1) predict the output schema of a workflow, (2) detect actors that will not invoke their inner component, e.g., due to errors in the configurations, and (3) display actor dependencies. Our contributions here are to show how to support these design features. To this end, we translate VDAL workflows to programs in the XML update language FLUX and extend the existing FLUX type system. This work has been published as [ZBL10].

Deciding Workflow Equivalence (Chapter 7)

We further investigate the fundamental problem of deciding workflow equivalence. XQuery equivalence under an ordered data model has not been studied well by the database community. As a first step towards solving XQuery equivalence and ultimately workflow equivalence, we develop a new approach for static analysis of a for-let-return fragment of XQuery. Our approach is based on the new concept of possible-value types (pv-types). These structures exhibit similarities with conventional regular-expression types, however, they are not approximating query execution (like conventional types), but capture the query semantics exactly. It is thus feasible to decide query equivalence based on the equivalence of the output pv-types of the respective queries. Our contributions here are to sevelop pv-types for XML processing, and to define requirements for sound and complete type propagations. We then present a set of type propagation rules and prove their soundness and completeness. Based on these notions, we then show how deciding query equivalence can be reduced to deciding pv-type equivalence, a problem that turns out to be hard as well. We show that in order to solve pv-type equivalence for an ordered data model, it is necessary to decide the equivalence of string-polynomials (SPE). String-polynomials are "polynomials" over a mathematical structure that has not yet been well studied. Our initial work towards solving SPE provides several normal-forms for string-polynomials that allow to approximate their equivalence. We conjecture the decidability of SPE and plan to continue this line of research as part of our future work.