Skip to topic | Skip to bottom

Provenance Challenge


Start of topic | Skip to actions

Second Provenance Challenge: VisTrails

Participating Team

Differences from First Challenge

We have changed the structure of our provenance representation to generalize and better structure our data, but the data stored is roughly equivalent to our previous representation. The schemas and data are provided below. Recall that we store workflow evolution in a vistrail which is a tree of actions where each node represents a (possibly partial) workflow. To allow easier integration with other systems, we have also materialized the individual workflow specifications for the three parts.

We split our original workflow into three individual workflows to better reflect the independence of the parts. In addition, because the AIR tools depend on a (.hdr, .img) pair of files, the workflows are slightly restructed so that module inputs and outputs are also paired using a FileSet module.

Provenance Data for Workflow Parts

The provenance data is split into three layers (workflow evolution, workflows, and execution). The schemas for these layers are available:

The data corresponding to these layers:

Note that teams may decide to use the vistrail data or the four materialized workflows for the challenge; the four workflows constitute a subset of the workflows contained in the vistrail. Please refer to the previous challenge for documentation on the system design.

Model Integration Results

We have successfully performed most queries using data from VisTrails, MyGrid, and Southampton. We have included our own system because our new query API is general and not native to VisTrails.

Model comparison

The VisTrails and MyGrid models were easy to use because of their simple data format, The generalized model of Southampton presented a greater challenge because of the many levels of nesting and abstractions. VisTrails required both the execution log and the workflow definition for the provenance queries whereas MyGrid and Southampton only needed the execution log. Finally, VisTrails supports a third level of provenance--the workflow evolution layer, and while we have not used it for this API, it has many benefits when asking queries about differences between workflows.


The answers obtained varied depending which information you had access to. For example, using the VisTrails format, it was not possible to obtain intermediate data items because they are not recorded. In this case the closest answer was the module executions. The queries required the data to contain at least module executions, connections between them and required annotations. These were all present in the models except a few missing annotations in Southampton and MyGrid.

VisTrails use a normalized data model and needs to use both execution log and workflow definition. MyGrid's execution log can be used without using the workflow definition and contain derivation relationships between data items, this makes the data contain redundant information. Southampton is modeling some security features that may be useful but makes the data larger and more complex.


The concept of data item varies between systems. It can be represented as the data exchanged between modules, the inputs or outputs of a workflow or a file reference passed between modules. The concept of parameters, which are used in VisTrails to modify modules, does not exist in other models. MyGrid uses something similar to edit the parameters of modules (like setting file name to save to). This concept is not clearly defined. Southampton have the concept of assertion where every module/service records its own view of the process. This concept does not exist in the other systems and is not used in our provenance queries. But it might be important for validating results.

Other concepts like modules/connections/executions are the same although most of them have different names.


Our method consists of using wrappers to translates the queries between a common data model and the source data. We first defined a high-level general model that captures the basic concepts of workflows and its executions. The model contains basic concepts making it possible to express queries over the different models. Second, we defined API functions for the wrappers that use this model. Finally, we implemented the wrappers and constructed the queries.

This challenge sought to address how provenance from different systems can be connected. However, there was no requirement for data products to be consistently idenitifed. Thus, in order to connect provenance across different systems, we had to manually identify the mapping between output data from one workflow and input data for the next. This naming is an important consideration when coordinating workflows across different systems. One solution is to use more general identifiers like LSID's or some other standard identifier.

Translation Details


Scientific Workflow Provenance Data Model (SWPDM)

The SWPDM (shown above) is a general provenance model that aims to capture entities and relationships that are relevant to both the definition and execution of workflows. The goal is to define a general model that is able to represent provenance information obtained by different workflow systems.


Our model is instantiated as a query API that operates on the concepts in the model. Vertices are modeled as objects and edges as operations on these objects. There also exists more complex operations that can traverse more than one edge which are used to model common provenance query operations.


This API is implemented as wrappers on top of the different data models. These wrapper functions translates the queries into a native query on the source. Currently VisTrails and Southampton uses XML with XPath as the access method. In this case the queries are translated into XPath expressions. MyGrid uses RDF/XML on a SPARQL server and the queries are translated into SPARQL expressions.

Using a combination of data sources (MyGrid->Southampton->Vistrails) we can now query the data using the API:

  r2 = pqf.getAllAnnotated(pModuleInstance,[('outputName', 'eq', 'atlas-x.gif')])
  prov = r2[0].getExecutionFromInstance()[0].upstream()

We then get the result:

  vt3:4 --> vt3:7
  vt3:1 --> vt3:4
  vt3:0 --> vt3:1
  pas2: --> vt3:0 --> pas2: --> pas2: --> pas2: --> pas2: --> --> --> -->

Which is the execution provenance trace of the file atlas-x.gif.


The benchmark is done using Query 1 (Upstream of AtlasXGraphic). It is a good general upstream query that returns the module executions in the upstream. The data files are too small for a good benchmark but we have timed the queries using the different systems.


opn = ''
rl = pqf.getNode(pOutputPort, opn, store3.ns).getDataFromOutPort()[0].getExecutionFromOutData()[0].upstream()

1 sec


ar = [('outputName', 'eq', 'atlas-x.gif')]
r1 = pqf.getAllAnnotated(pModule,ar)[0].upstream()

0.1 sec


odn = 'challenge/atlas-x.gif'
rl = pqf.getNode(pDataItem, odn, store3.ns).getExecutionFromOutData().upstream()

1 sec

Benchmark results

Although these times are very short, there seem to be two main factors influencing the result: The query engine used and the size of the data. VisTrails is fastest using an XPath processor and a small amount of data. The MyGrid data file is small but it uses a SPARQL server which is slower than using XPath. Southampton uses XPath but has large data files. These results includes initialization of the wrapper and some extra pre-processing for Southampton to calculate the data links. But they have at most biased the result by a factor of 2.

Further Comments

Provide here further comments.


In the general case, tracking provenance through different systems is a data integration problem. But by defining a common model (SWPDM) on a restricted domain (Scientific Workflow) the difficulty is reduced to efficiency and entity resolution problems. We believe that it should be possible for the Scientific workflow community to support a model similar to the SWPDM to enable provenance to be tracked through their systems. We have showed that an API for querying this model can be built and its compatibility with three of the current systems.

Problems for discussion:

How to connect these systems? There is a need for the data to support referencing other models. E.g. If a data item is stored externally and tracked through another provenance store. Common identifiers like LSID:s might be part of the solution. External data items should also be given a namespace to indicate where they came from.

Is there a way to come up with common concepts for data items, they are used in many layers and have different meanings.

How can a user easily express these kind of queries?

Query complexity - Relational Algebra cannot express these kind of provenance queries because of the use of transitive closure.

-- TommyEllkvist? - 21 Jun 2007

to top

End of topic
Skip to action links | Back to top

I Attachment sort Action Size Date Who Comment
pc_vt.xml manage 76.6 K 23 Feb 2007 - 00:20 JulianaFreire  
pc_part1.xml manage 12.8 K 23 Feb 2007 - 01:05 JulianaFreire  
pc_part2.xml manage 4.0 K 23 Feb 2007 - 01:05 JulianaFreire  
pc_part3a.xml manage 5.1 K 23 Feb 2007 - 01:05 JulianaFreire  
pc_part3b.xml manage 5.7 K 23 Feb 2007 - 01:06 JulianaFreire  
pc_log.xml manage 11.3 K 23 Feb 2007 - 00:22 JulianaFreire  
vistrail.xsd manage 6.4 K 23 Feb 2007 - 00:23 JulianaFreire  
workflow.xsd manage 3.5 K 23 Feb 2007 - 00:24 JulianaFreire  
log.xsd manage 2.7 K 23 Feb 2007 - 00:24 JulianaFreire  
model.png manage 28.3 K 21 Jun 2007 - 08:50 JulianaFreire  
model_mygrid.png manage 22.7 K 21 Jun 2007 - 08:56 JulianaFreire  
model_southampton.png manage 13.2 K 21 Jun 2007 - 08:56 JulianaFreire  
model_vistrails.png manage 22.5 K 21 Jun 2007 - 08:57 JulianaFreire  
vt_prov_challenge_present.ppt manage 711.0 K 02 Jul 2007 - 16:03 JulianaFreire VisTrails Second Provenance Challenge Presentation manage 21.0 K 14 Aug 2008 - 16:39 JulianaFreire API source files

You are here: Challenge > SecondProvenanceChallenge > ParticipatingTeams2 > VisTrails2

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.