Skip to topic | Skip to bottom

Provenance Challenge


Start of topic | Skip to actions

National Center for Supercomputing Applications

Participating Team

Differences from First Challenge

With respect to workflow enactment, we have not re-run the workflows for the second challenge. Instead our entry is based on the provenance traces we collected during the first challenge.

Provenance Data for Workflow Parts

In our entries to the first challenge, we did not collect execution traces in the form of files; we collected them in the form of RDF statements.

We implemented the challenge workflow using two different workflow engines (D2K and CyberIntegrator?), each of which had been modified to generate execution traces in the form of RDF statements. No attempt was made to coordinate the vocabulary, ontology, or structure of those statements between the two implementations, so there are numerous differences which are detailed in our first entry.

The complete traces are given below in RDF/XML format.

Here is a rough partition of the execution trace data into the three parts called out in the second challenge description. For D2K, the subsets were generated by finding all RDF statements within three predicates of an RDF subject representing a relevant D2K module (e.g., for part 2, the module responsible for executing Softmean). Note that some statements will be missing from the concatenation of these three parts, but none of the missing statements are relevant to the provenance queries.

Roughly same approach was used for CyberIntegrator?.

The annotation data is not included in these traces. That data is available in separate files that are attached to our first challenge entry and can be merged with the data above using freely-available RDF tooling such as Jena or Tupelo.

Partitioning the statements in these traces into separate files does not change the meaning of any of the RDF statements in them, so it may be more expedient to operate on the complete traces. Standard RDF tooling provides facilities for extracting graph subsets from RDF data for the purpose of transforming it into other data structures.

Because the D2K and CyberIntegrator? teams did not specify a formal ontology or vocabulary for their execution traces, the traces are subject to interpretation. Our interpretation is described in our first challenge entry and below we will attempt to develop a generalized formal interpretation that can be applied more broadly.

Model Integration Results

We integrated several teams' models; our results are summarized in this slide set (PDF, 1.5MB), and described in much greater detail below.

To aid in rapidly assessing model integration results, we developed this visualization. Use the combo boxes to select a team for each stage. The visualization shows the result of our model integration for the given combination of teams.


Conceptual overview

To perform model integration, we adopted a number of working assumptions:

These assumptions are problematic with respect to the workflow traces we investigated, and the problems we encountered reflect both on the way those traces are designed and the usefulness of our working assumptions.

In our view our working assumptions are an approximation of the at-scale model integration problem, since most information that would allow non-minimal assumptions to be made is not available in machine-readable form for consumption by unassisted software agents. For example the ontologies provided by some of the teams are all disjoint although they appear to contain many concepts that could be manually correlated.

The challenge description itself causes some problems with respect to how rigorous the model integration process can be, since the specification of the workflow and queries is not expressed in a formal language and therefore there is no rigorous way to prove that any given implementation is correct.

Datasets selected

We performed model integration over seven traces:

  2. ES3
  3. Karma
  5. MyGrid
  6. SDG
  7. VisTrails

These traces can be classified as follows:

  1. SOA traces (CESNET, Karma, MINDSWAP)
  2. Other XML formats (ES3, VisTrails)
  3. RDF and/or OWL (MINDSWAP, MyGrid, SDG)

We did not attempt the other traces for a variety of reasons:


Given the naiive interpretation and open-world assumptions described earlier in this report, our model integration strategy was:

In stage 1, we attempted a minimal conversion, which entailed not only using as little challenge-workflow-specific knowledge as possible, but also making minimal effort to reconcile differences between workflow traces (e.g., identifiers of steps and datasets, granularity of description), leaving them largely unaltered.

In stage 2, we also added identity assertions where we could not find an explicit correspondence between nodes across the parts for a single system. In some cases this could have been corrected with more elaborate stage 1 processing, but our hypothesis was that adding the assertions in stage 2 would require less manual effort.

In stage 3, connectedness was examined manually in order to locate and debug erroneous identity assertions.

In stage 4, the queries were executed automatically. This required adding identity assertions for query parameters (e.g., atlas X graphic) and results.


The model integration was implemented in large part using NCSA's Tupelo semantic content repository, which provides an API for managing and processing RDF information. However, none of the techniques we used depend on unique features or capabilities of Tupelo, so the strategy could have been implemented with standard RDF tooling instead, including API's such as Jena or databases such as Mulgara.

Supporting technologies included Jena (for RDF/XML processing), XSLT transformations, and Graphviz.

Implementation stage 1

To formalize our naiive interpretation assumption, we developed a minimal RDF vocabulary for describing steps, datasets, input/output relationships, and other aspects of the information required to answer the challenge queries. This vocabulary was not described as an OWL ontology or otherwise formally characterized. None of the minimal vocabulary is specific to the provenance challenge workflow. For model integration the most important terms in the vocabulary are the following:

In stage 1, RDF statements were extracted from workflow traces. This was accomplished in one of two ways depending on the serialization employed:

1. RDF/XML serializations (MINDSWAP, MyGrid, SDG) were parsed using Jena and RDF statements using our minimal RDF vocabulary were generated using rule-based transformations specific to each dataset implemented using Tupelo's "Transformer" operator (which is analogous to a simple rule in a logic programming language) 2. Other XML serializations (CESNET, ES3, Karma, VisTrails) were transformed from XML into N-Triples using XSLT stylesheets specific to each dataset.

A number of interesting issues were encountered during stage 1 for a number of the workflow traces. These are detailed in the "Notes on the teams" section below.

Implementation stage 2

The challenge specification required the teams to split their workflow traces into three parts. Because of our naiive interpretation and open-world assumptions, we assumed that the files used to serialize the workflow traces for a given team each described the part of the workflow they were supposed to, and that they described parts rather than complete workflows. Accordingly, we assumed that there is a single item in each workflow trace part corresponding to a box on the boundary of two workflow parts, e.g., "Resliced Header 2." In cases where this assumption was not met mitigation was required; see "notes on the teams" below.

Once the boundary Datasets were identified per-team, we attempted to merge all seven teams' traces in the following way:

Implementation stage 3

After the owl:sameAs statements were added, each of the 343 possible combinations of workflow parts from the seven teams were merged automatically and a graphic was produced of each result. These graphics were inspected manually for debugging errors in stage 2 processing.

Implementation stage 4

In stage 4, query #1 was executed over all 343 integrated models. This was accomplished in the following way:

First, asserting the equivalence of corresponding Steps in all seven models using owl:sameAs. For example, both "" (from SDG) and "" (from CESNET) represent the procedure "1. align_warp" in the challenge workflow.

Then, for each of the 343 possible combinations of workflow parts:

PREFIX t: <,2006:/2.0/pc2/>
SELECT ?output t:dependsOn ?step
{ ?step rdf:type t:Step .
  ?step t:stepHasOutput ?output . }

PREFIX t: <,2006:/2.0/pc2/>
SELECT ?step t:dependsOn ?input
{ ?step rdf:type t:Step .
   ?step t:stepHasInput ?input . }

Note that step three does not confirm that the intermediate products are also returned, but the dependency production rules guarantee it unless the workflow interpretations are malformed with respect to input/output relationships, which the manual inspection in stage 3 confirmed that they're not.

Implementation code

The complete model integration implementation code is available at the following link. The archive contains the teams' data as well as the code used to process it.

* tupelo-pchal2.tgz (466K)

Building and executing the model integration implementation requires Java 1.5, Maven, and internet access (for downloading required libraries).

Notes on the teams

General notes

The naiive interpretation and open-world assumptions generally fit fairly well with the workflow traces we selected. For instance most teams identified Steps and Datasets. A completely-connected workflow description could be assembled from parts for all seven of the teams using our implementation strategy without requiring developing alternative, team-specific strategies. Most of the work in integrating data came from figuring out what XSLT templates or Tupelo Transformers to apply to extract globally-consistent identifiers and descriptions of Steps, Datasets, and input/output relationships.

The naiive interpretation and open-world assumptions proved problematic in many cases, especially with the XML traces. In particular, naiively interpreting some of the traces produced graphs that differed in a number of details, e.g., whether the output of reslice was a single dataset (presumably representing both the image and its associated header file) or two. In other cases the workflow parts were not divided as described in the challenge specification and in several cases additional "Steps" were detected that simply represented the operation of reading a file or combining more than two inputs together, operations that are implicit in the abstract challenge workflow specification (i.e., there are no rectangles representing those operations.)

Some of the workflow traces contained challenge-workflow-specific terminology or data structures. For instance the MINDSWAP and Karma traces linked Steps to input and output datasets using terminology specific to the workflow step being executed (e.g., "hasInputSlice" and "WarpOutParamFile"), requiring that we assert our a priori knowledge about how those specific terms relate to our minimal vocabulary. Because such a priori knowledge is not likely to be available at scale, this type of data structure introduces significant interoperability and preservation risks.

RDF requires that every thing that RDF statements describe be identified with a globally-unique URI. Since XML has no such requirement, it was challenging finding information in XML traces that could be used to construct URI's for identifying Steps and Datasets. In many cases in the XML traces, Steps and/or Datasets were identified with strings that had no obvious uniqueness guarantees (e.g., small integers) and in those cases we assumed that the identifiers were locally scoped and typically added additional information and a URI prefix. We made no attempt to use consistent identifiers of Steps and Datasets across systems, because that would have required the consideration of a priori knowledge of the challenge workflow description.

One of the biggest problems we encountered in interpreting workflow traces was consistently identifying Steps and Datasets across workflow parts, since a number of the teams chose to partition the challenge workflow by having their system execute each part as a separate workflow, with no shared identifiers for Steps or Datasets. In some cases to solve this problem we may have mistakenly used identifiers of classes (e.g., "any execution of X Slicer") instead of using identifiers of instances (e.g., "the execution of X Slicer in this workflow execution") and dealing with manually corresponding instances across workflow parts.

The following notes explain our interpretation of each team's workflow trace.


Stage 1

CESNET's workflow trace is serialized in XML. Steps are identified with the XPath expression "/chal2:workflow/chal2:job/@id". Inputs and outputs are described using the "inputs" and "outputs" sub-elements of /chal2:workflow/chal2:job elements. Each input or output is named, and contains any number of "file" children, each of which contains any number of "url" children. We assumed that the "file" children represented the inputs and outputs of the job, respectively, but because the relationship between those "file" elements and the enclosed "url" elements was not specified (i.e., the same "url" element encloses each URL, regardless of whether it apparently represents e.g., an image or a header file) we assumed that the URN contained in the "name" attribute of each "file" element identifies a single input or output Dataset.

Since the URN's identifying Datasets are consistent across the CESNET files describing each workflow part, we did not need to assert any correspondences manually in order to reconstruct the complete workflow. It was also unnecessary to use any challenge-workflow specific information in stage 1.

Stage 2

In stage 2, we were faced with the problem that our stage 1 interpretation identified half the boundary Datasets required, since we could not naiively interpret the "url" elements as Dataset identifiers. We therefore asserted equivalences between CESNET's resliced image and atlas image Datasets and the image Datasets in our abstract boundary Dataset vocabulary (e.g., "Resliced Image 3" and "Atlas Image") on the assumption that header files were so closely associated with image files that we could ignore the header files and still arrive at a consistent model integration. In the case of the atlas image, because it is the only output of softmean and the only input of slicer, no manual disambiguation was required. In the case of resliced images, we looked at the first workflow trace and assumed that the Dataset identifier containing the term "anatomy1" was Resliced Image 1, the Dataset identifier containing the term "anatomy2" was Resliced Image 2, etc.


Stage 1

ES3's workflow trace is serialized in XML. Unlike other traces, ES3's trace is organized structurally around relationships between Steps and Datasets, rather than assigning either Steps or Datasets structural priority (i.e., by placing the elements representing them higher in the XML tree.) ES3 models these relationships as non-commutative binary relationships between "transformations" (i.e., Steps) and "files" (i.e., Datasets.) This made it easy to identify input and output relationships, but the XML structure contains a large amount of redundant information because the source and target of a link are redescribed for each link the source and target participate in. For instance "/ES3response/link[1]/from/object" in getLineage-part1.xml has identical contents as "/ES3response/link[2]/from/object". A number of complex XPath expressions were required to eliminate redundant information from the output.

Steps are easily identified in ES3 traces by UUID, for instance at "/ES3response/link[1]/from/object/uuid" in getLineage-part1.xml. Since no steps occur in more than one file, this identification scheme does not cause problems in stage 2. Datasets are not so easily identified. Even though a "localId" is provided which matches across workflow part trace files, because the local ID is not globally unique we chose to interpret the "md5sum" element as the identifier of the Dataset, since presumably it will not vary if the dataset is moved from one location to another. Alternatively, we could have prepended a URI prefix to the local ID.

Stage 2

All ten boundary Datasets are identified in ES3 data, so the only disambiguation required was between resliced images/headers 1-4. We assumed that the digits in the localId fragments in the form "...resliced[1-4].img" and "...resliced[1-4].hdr" identified which of the four image/headers each Dataset represented.


Stage 1

Karma's workflow traces are represented in XML, incorporating a number of standard XML serializations from SOA-type systems. Accordingly, the traces focus on service execution, with inputs and outputs being described as structurally secondary, largely as parameterizations of service invocations and contents of messages sent and received by services.

There is no single XML data structure in Karma trace data that completely describes a Step or Dataset. Instead, events in the lifecycle of a service are described, and the existence of Steps and Datasets can be inferred from those event descriptions. In the case of Steps, wor:serviceInvoked elements contain a number of structures that uniquely identify Steps, of which we chose the wor:serviceInvoked/wor:notificationSource/@wor:workflowNodeID attribute as the Step identifier, qualifying it with an arbitrarily-chosen URI prefix.

Datasets are more problematic in Karma, because the role of files as inputs or outputs to Steps is implicit in service-specific parameter descriptions such as the ones found in "/activityDump/wor:serviceInvoked/wor:request/wor:body/S:Body" elements. Interpreting these parameters as the identifiers of input and output Datasets required challenge-workflow-specific XSLT templates. Once the correct elements were identified, their contents are all URN's (e.g., "urn:leadproject-org:data:702cefe6-4230-496f-8cb2-6895a92c2436" identifies "Resliced Image 1") which we used unmodified as Dataset identifiers.

Stage 2

At stage 2 we had the ten boundary Dataset identifiers in hand and we manually correlated them to the abstract challenge workflow by reading the digits from the names of the XML elements containing them (e.g., "ReslicedHeaderFile3").


Stage 1

MINDSWAP's execution traces are serialized in RDF/XML. All Datasets and Steps are identified with URI's, and input/output relationships are identified with challenge-workflow-specific predicates. The presence of challenge-workflow-specific terms in MINDSWAP's ontology made the input/output relationships more difficult to interpret than they otherwise would have been, since rules were required for each type of processing step (e.g., align_warp, slicer) in order to infer the generic input/output relationships expressible in our vocabulary.

There were also errors in MINDSWAP's execution trace. Specifically, the inputs to softmean contain duplicate triples and incorrectly identify headers as input images. We solved these problems after consulting with MINDSWAP by manually making a partial correction to the workflow trace data.

Stage 2

Unlike other trace data, the identifiers and data structures immediately surrounding the Dataset identifiers on the boundary between workflow parts 1 and 2 (e.g., "Resliced Header 3") in MINDSWAP traces do not contain any clue about which of the four input anatomy images the Dataset was derived from. In order to disambiguate them, we traced the execution back to the align_warp stage and read the digits from the relatively straightforward identifiers of that stage's input (e.g., "challenge/anatomy3.img").


Stage 1

MyGrid's workflow traces are represented in RDF/XML. Steps and Datasets are subjects of type and, although we used the values of the #runsProcess and #outputDataHasName properties to establish equivalences between instances of the "same" Step or Dataset across workflow parts.

Stage 2

Again, disambiguating reslice outputs was made easy in this case because the identifiers we selected included telltale digits.


Stage 1

SDG's workflow traces are represented in RDF/XML. Vocabulary mapping alone was almost enough to complete stage 1. The following correspondences were used:

In the case of stepHasInput, a rule had to be used because SDG's ontology uses the sdg:isInput predicate whose subject is the Dataset and whose object is the Step.

Stage 2

The selected Dataset identifiers contained the digits necessary for manual disambiguation of the reslice outputs.


VisTrails was the most difficult trace of the seven to interpret because its representation of the workflow is difficult to interpret as being identical to the challenge workflow without taking into account significant a priori knowledge of the challenge workflow structure. Instead of doing this, we chose instead to see how far we could get while using as little such a priori knowledge as possible.

Stage 1

VisTrails's workflow traces are represented in XML. Unlike the other six traces we selected, VisTrails's workflow model does not appear to be reconcilable with our assumption that all the traces would contain descriptions of the Steps and Datasets identified in the challenge workflow specification. In the VisTrails model, workflow descriptions consist entirely of "modules" and "connections" between modules. Connections appear to be binary and ordered by their "port/@type" attributes (the values of which are either "source" or "destination".) Some modules appear (to a human reader) to represent Datasets (e.g., "AIRHeaderFile") and some appear to represent Steps (e.g., "Reslice")--these descriptive terms are found in "/workflow/module/@name" attributes. Some appear to represent neither Steps or Datasets but rather structural relationships involving combining or splitting data (e.g., "List6").

Given that the VisTrails notion of "module" is insufficient to disambiguate these three classes of uses, we could have chosen to use module names as evidence to interpret some modules as Steps, some as Datasets, and some as neither. But even doing that would not have allowed us to reconstruct the challenge workflow, since for instance there is no module that appears to represent the output of an align_warp step. Instead, we attempted a naiive interpretation based on the following incorrect but challenge-workflow-neutral assumptions:

Even though these are incorrect assumptions, they enabled us to reconstruct something resembling the challenge workflow description, albeit with a large number of "extra" Steps and a chain of "List" Steps that appear to merge reslice outputs two at a time into a single input for softmean.

Identifying Steps was made difficult because there is no single string anywhere in the VisTrails trace that uniquely identifies a module. Instead, modules are identified with small integers that are only unique per-workflow. Workflows are also identified (elsewhere) with relatively small integers, and appear to come with no uniqueness guarantee or namespace scoping mechanism. To work around these problems, we constructed globally unique ID's for modules by assuming that workflow ID's were globally unique, prepending them (with a separator) to module ID's, and prefixing that string with an arbitrarily-selected URI prefix. Identifying the Datasets that were implied in our interpretation was accomplished by assembling an identifier from the identifiers of the Steps involved in the connection that implied the Dataset. This achieved uniqueness, but because VisTrails modeled the three workflow parts as separate workflows, the identifiers we constructed did not allow us to connect the parts together. To solve this problem, we had to add correspondences manually based on our a priori knowledge of the challenge workflow structure.

Stage 2

Because of our naiive interpretation of VisTrails, it was difficult to decide which Datasets in our interpretation corresponded to boundary Datasets in the challenge workflow description. We selected the datasets closest to the "edges" of each workflow part because any other choice would have required the use of a priori knowledge of the challenge workflow. Disambiguating the four outputs of reslice was accomplished by manually inspecting the "/workflow/module/function[@name='outputBaseName']/parameter/@val" attributes. As a result of our selection strategy, the merged workflow parts interpenetrate a bit, with "dangling" Steps that appear redundant. Nevertheless, the result of stage 2 enabled VisTrails to be merged with all six other modules into a completely-connected, directed workflow.


From an interoperability and digital preservation perspective, a number of practices used in the workflow traces we examined create significant problems. These practices are described below along with recommendations of alternative techniques that mitigate some of the problems created.

Use explicitly-scoped identifiers

Locally-scoped identifiers are often employed in XML data, typically as a means of working around XML's strictly hierarchical Document Object Model. Unfortunately, such identifiers do not provide uniqueness guarantees that extend beyond a single document. The challenge description's requirement that workflow descriptions be separated into parts caused this approach to "break" for a number of systems, and at-scale integration is likely to create the same problem, because real work processes often span distributed, heterogeneous environments and cannot practically be described as a single workflow trace. In addition, in most of the cases in which locally-scoped identifiers were used, no standard identification schemes with known scoping guarantees were used, making it impossible to determine in what contexts the ID could be used unmodified. As a result, the ID's typically have to be modified or mapped in non-standard ways in order to be reused outside the document in which they originally occur, which introduces significant scaling and preservation risks.

Explicitly-scoped identification schemes (e.g., URI's, QNames, UUID's, X509 Distinguished Names) solve most of these problems.

Use application-non-specific data structures

At scale, systems must be able to capture and interpret data about any workflow they are capable of running without requiring workflow-specific configuration beyond the specification of the workflow itself. Several of the systems we investigated (Karma, MINDSWAP, VisTrails) do not meet this requirement because they either use challenge-workflow-specific data structures in their schemas/ontologies or use data structures and vocabularies that are ambiguous without considering a priori knowledge of the challenge workflow.

For instance, VisTrails characterizes every process step and data item as a "module," making it impossible to disambiguate process steps and data items without taking into account the uncontrolled vocabulary of challenge-workflow-specific module names employed in the workflow trace (e.g., "AlignWarp", "AIRHeaderFile".) In MINDSWAP's case, it seems plausible based on our interpretation of it that its challenge-specific ontology could be generalized with straightforward OWL constructs for associating classes and properties with more general super-classes and super-properties.

In our view this problem is best addressed with ontologies that explicitly relate workflow-specific classes and properties with the more general classes and properties they represent. Alternatively, systems can be designed to record only general descriptions, which CESNET, ES3, MyGrid, and SDG largely do.

Do not store structured information in unstructured containers

Semi-structured formats such as markup languages are designed to contextualize unstructured data. In practice, many XML formats embed structured data in PCDATA sections using ad-hoc and/or non-standard formats. This makes the metadata embedded in those structures impossible to interpret without human intervention or a priori knowledge of the non-standard formats employed. Even when a standard format is employed, it cannot normally be parsed by standard XML tooling such as XSLT and XQuery processors, making it relatively difficult to extract.

For example, the inputs of align_warp comprise header and image files. In most of the workflow traces we investigated, this classification was not explicitly stated in the information except in the strings used to identify or name those inputs (e.g., "AnatomyHeader3"). In some cases, for instance CESNET, there is no XML data structure that can be used to determine whether each of these inputs is a header or image. In Karma, on the other hand, each input or output parameter is explicitly identified, which unfortunately also makes it impossible to determine if any of them belong to the same class (which they do: "alignwarpservice_runtypens:AnatomyImageFile" and "alignwarpservice_runtypens:ReferenceImageFile" both represent images.)

In our view, data structures whose unstructured sections contain metadata in ad-hoc or non-standard formats are underspecified, risking the loss or misinterpretation of that metadata by downstream processors. This problem can be solved by refactoring the data structures so that the structural relationships embedded in unstructured data are represented as first-class data structures, e.g., XML elements and attributes.

-- JoeFutrelle - 31 Jul 2007
to top

End of topic
Skip to action links | Back to top

I Attachment sort Action Size Date Who Comment
d2k.xml manage 70.5 K 22 Feb 2007 - 20:43 JoeFutrelle Entire D2K provenance trace in RDF/XML format
cipc.xml manage 58.5 K 22 Feb 2007 - 20:29 JoeFutrelle Entire CyberIntegrator? trace in RDF/XML format
d2k_part1.xml manage 40.4 K 22 Feb 2007 - 21:21 JoeFutrelle Part 1 of the D2K execution trace, in RDF/XML format
d2k_part2.xml manage 24.9 K 22 Feb 2007 - 21:21 JoeFutrelle Part 2 of the D2K execution trace, in RDF/XML format
d2k_part3.xml manage 31.0 K 22 Feb 2007 - 21:22 JoeFutrelle Part 3 of the D2K execution trace, in RDF/XML format
cipc_part1.xml manage 36.8 K 23 Feb 2007 - 19:37 JoeFutrelle Part 1 of the CyberIntegrator? execution trace in RDF/XML format
cipc_part2.xml manage 28.2 K 23 Feb 2007 - 19:37 JoeFutrelle Part 2 of the CyberIntegrator? execution trace in RDF/XML format
cipc_part3.xml manage 32.7 K 23 Feb 2007 - 19:38 JoeFutrelle Part 3 of the CyberIntegrator? execution trace in RDF/XML format
tupelo-pchal2.tgz manage 466.7 K 31 Jul 2007 - 17:27 JoeFutrelle Model integration implementation
pchal2slides.pdf manage 1542.0 K 31 Jul 2007 - 17:51 JoeFutrelle Workshop slides

You are here: Challenge > SecondProvenanceChallenge > ParticipatingTeams2 > NCSA2

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.