Provenance Challenge: UTEP Trust Lab team
Participating Team
Team and Project Details
- Short team name: UTEP
- Participant names: Paulo Pinheiro da Silva, Nicholas Del Rio, Leonardo Salayandia
- Project URL: http://trust.utep.edu
- Project Overview:
- Relevant Publications:
Workflow Representation
In our approach we use abstract workflows instead of executable workflows. By using the WDO-It! tool (
http://trust.utep.edu/wdo/downloads), we start by first creating an ontology of the concepts that will be used in the creation of the workflow. The ontology is referred to as a Workflow-Driven Ontology, and it mainly consists of two hierarchies of concepts (or classes): Data and Method. Data concepts are those that represent some parameter, dataset, or user input in the workflow. Data concepts are illustrated as directed edges in the workflow graph. Method concepts are those that represent functionality that takes Data as input, and transforms it into some other Data output. Method concepts are illustrated as rectangles in the workflow graph. The purpose of creating abstract workflows instead of executable workflows is to emphasize understandability of the process being represented by the workflow. Hence, it is encouraged to use Data and Method concept names that are meaningful to the workflow creator. For example, in reference to the first workflow below, the workflow author thought that the Method concept name "CheckManifestFile" captured the intended meaning of the sequence of actions of "IsCSVReadyFileExist" and "ReadCSVReadyFile" from the specification of the PC3 workflow.
Once the ontological concepts have been identified and captured in the ontology, the abstract workflow is constructed with the WDO-It! tool by creating "instances" of the ontology concepts, and connecting Data and Method concepts accordingly to specify the intended workflow behavior. In addition, the abstract workflows created with the WDO-It! tool ground Data concepts to Sources (and Sinks), which are concepts that are reused from the provenance component of the Proof-Markup Language (PML-P). With respect to PML, Sources and Sinks are equivalent and we only refer to them as Sources. Sources represent the entitites where data is coming from (or where the data is eventually going to). For example, a Source can be a Database, a Document, or a Human user. These are represented as ovals in the workflow graph.
Finally, different levels of abstraction are also supported. The first workflow represents the most abstract workflow representation of the PC3 workflow. The second workflow, on the other hand, represents a lower level of abstraction of the "PopulateDB" method shown in the first workflow.
- First workflow: Abstract workflow representation of the PC3 workflow
- Second worklfow: More detailed abstract workflow of the PopulateDB? method shown in the first workflow
PopulateDB? method shown in the first workflow" width="573" height="513" />
Logging Provenance
One benefit of authoring abstract workflows using WDO-It! is the ability to generate “wrappers” and “data annotators,” which are modules designed to capture and encode provenance associated with an abstract workflow, during runtime and post-runtime respectively. The main distinction between the two logging methods has to do with
when the provenance is logged, which has ultimately has implications on
how it is logged. Certain properties of the workflow will dictate when one method should be used over the other, for example when intermediate artifacts are not persisted during execution of the workflow, a wrapper approach must be used to capture these intermediate artifacts before they are lost, as is the case when running the PC3 workflow using the Java version. In this case, the intermediate results only exist as Java objects that get removed from memory at the end of execution, thus a wrapper approach is necessary to capture these objects during runtime before they are destroyed. This implies however that the workflow be instrumented to invoke wrapper modules thus requiring alterations to an otherwise tried and tested workflow.
If a workflow does not delete intermediate results, then the non-invasive “data annotation” method can be used. This module can “piece together” provenance by chaining the intermediate results based on their “wasDerivedFrom” relationship. When running the batch version of the PC3 workflow, provenance could be captured with a data annotator because the batch files do not cleanup the intermediate XML files that get dumped.
It is important to note that most of the information needed to generate a fully functional wrapper or data annotator is contained in the abstract workflow. All the relationships between data, methods, and PML sources in a particular workflow are captured in WDO-It! and this knowledge is leveraged to help generate a wrapper or data annotator that needs very minor tweaks to get to work.
For this challenge we opted to use the batch version of the PC3 workflow and employed a wrapper approach for logging provenance, even though we could have used a data annotator. Provenance for this workflow was encoded in the
Proof Markup Language (PML), the default encoding language of both the wrappers and data annotators. Our PML based provenance dump for the PC3 workflow can be found
here. The start nodeset of the PML provenance graph can be found
here.
Executing Wrappers
The wrappers generated from WDO-It! are not fully functional and need to be enhanced before they can be executed.
Visualizing Provenance
Probe-It! is a browser suited to graphically rendering
Proof Markup Language based provenance associated with results derived from both inference engines and workflows. You can open Probe-It! already showing the PC3 PML provenance by clicking
here.
Probe-It! consists of three primary views to accommodate the different kinds of provenance information: result view, global justification view, and local information view, which refer to final and intermediate data, descriptions of the generation process as a whole, and information about a specific step in the process respectively. Below is a partial PML trace of the PC3 workflow as visualized in Probe-It! The orange boxes on the top left and top right correspond to the workflow inputs, the XML file encoding the
CSVRootPath? and the
JobID? respectively. The arcs represent the "usedBy" relationship in
OPM. However this is not an
OPM graph and in PML terms the arcs actually represent the "hasAntecedents" relation.
* Probe-It! screen shot visualizing PC3 PML:
Open Provenance Model Output
Query Results
Suggested Workflow Variants
Suggested Queries
Suggestions for Modification of the Open Provenance Model
Conclusions
to top