Composing a scientific experiment with several different workflows
Scenario Authors: Scientific Workflow Group, COPPE, Federal University of Rio de Janeiro
Brief Summary:
A brief summary of the proposed scenario. A scenario where pre-existing workflows were conceived independently, using different scientific workflow management systems (
SWfMS?). However, these independent workflows need to be integrated into a complex experiment, which entail some additional manual activities that link such workflows. How can these two different workflows be related ? How to link the last activity of workflow 1 to the first activity of workflow 2 ? In this scenario, each
SWfMS? may manage provenance information in a decentralized and isolated way, meaning that each system considers provenance in a specific granularity, stores the information on a specific language, or even worse, some
SWfMS? may not even provide a provenance solution at all.
Scenario Diagram:
There are several scenarios of workflow execution in a distributed environment. Each one has its own characteristics that make provenance management difficult. According to Fig. 1, these scenarios can be classified into four types:
- remote execution of one or more workflow activities;
- remote execution of a sub-workflow;
- remote execution of a sub-workflow by another SWfMS?;
- and execution of two or more workflows that are part of the same experiment in distinct SWfMS?.
The first type is the simplest scenario since a
SWfMS? that provides a good provenance management is able for gathering provenance data even when activities are executed remotely. However, for the remaining scenarios, the use of a
SWfMS? is not enough to manage all the provenance information, since data can be lost. For example, in the second and third scenarios, the
SWfMS? that is executing the main workflow does not know about the remote execution of the sub-workflow. It does not know what activities are executed nor even what data were consumed or generated by them. The
SWfMS? only knows the input and output data of the whole sub-workflow. In the third scenario, the scientist has the chance of verifying this provenance information in the secondary
SWfMS?. Nevertheless, there is a high probability that this information is represented differently from the main
SWfMS?, making the analysis process more complex to the scientist. The fourth scenario is an extreme situation where the workflow of an experiment is fragmented into several smaller workflows in order to be executed in an heterogeneous environment, that has different
SWfMS?. In this case, each
SWfMS? manages provenance information in a decentralized way, meaning that each system considers provenance in a specific granularity, stores the information on a specific language, or even worse, some
SWfMS? do not provide a provenance solution at all. In situations like that, we can say that the experiment has a heterogeneous provenance support. This last scenario, although being an extreme situation, is becoming common in scientific experiments. This behavior is motivated by the fact that specific
SWfMS? have particular properties, and the adoption of different
SWfMS? in different regions of the workflow is more advantageous than the adoption of only one
SWfMS? for the workflow as a whole. For example, some workflow regions need to be executed in
SWfMS? that support results visualization. Other workflow regions need to be executed in
SWfMS? that provide grid support, and so on. An additional reason may be due to organizational issues that imply that the workflow execution should occur in several laboratories of a virtual institute.
Users:
Scientists of scenarios such as bioinformatics and oil industry
Requirement for provenance:
Since different workflows will be handled by
SWfMS? individually and isolated from each other, it can be impossible to trace back to the original first workflow activity.
Provenance Questions: What were all the activities of this whole experiment ? What was the previous activity before activity X (where activity X is the first activity of workflow #2 and the user wants to know the last activity from workflow #1 or a manual activty).
Technologies Used:
databases, web browser, scientific workflow, object identity and global relationships
Background and description:
Any background or other important information about the scenario
to top