Skip to topic | Skip to bottom

Provenance Challenge


Start of topic | Skip to actions


PowerPoint presentation

Participating Team

Workflow Representation

ES3 executes the challenge workflow shell script directly, without any modification.

The corresponding workflow representation is assembled post hoc (as described below) by ES3, and is retrieved from ES3 as a GraphML document. The workflow diagrams in this report were generated by yWorks' yEd graph Editor, using reformatted ES3 GraphML documents as input. Files are represented as circles and transformations as squares. Process arguments are omitted to minimize clutter.

Provenance Trace

Provenance in ES3 is managed by two components: the Probulator, and the ES3 Core:

Unlike its namesake, the ES3 Probulator is designed to non-intrusively monitor the execution of complex scientific applications. All operations of the Probulator are completely transparent to ES3 users, and the default mode of operation requires no modification whatsoever of existing codes.

The Probulator comprises two applications, the Logger and the Transmitter. The Logger automatically instruments, monitors, and logs the execution of targeted programs and their interactions with their environment (files, parameters, system calls, etc.) A family of plug-ins adapt the Logger to different scientific processing environments. Currently two plug-ins are provided:

  1. The default plugin uses system call tracing to intercept and log a subset of the probulated process's system calls. This plugin currently works on Linux (and should work on any UNIX-like system that supports the "strace" facility.)
  2. A plugin for the IDL analysis environment preprocesses IDL scripts to insert ES3 specific logging information, and to replace calls to certain IDL built-in functions with calls to instrumented ES3 equivalents. Although this plugin does modify the targeted application code, it does so transparently and reversibly -- no user intervention is required beyond setting a flag in an environment variable to enable or disable probulation.

Upon termination of a Logger session (or on specific request), Logger log files are read by the Transmitter, which:

  1. assigns a universally unique identifier (UUID) to every provenance-relevant object (file or process) referenced in the log file;
  2. converts the plugin-specific log files into standard ES3 execution reports; and
  3. sends these reports as XML messages via a web service interface to the ES3 Core.

The ES3 Core decomposes the execution reports into object references and linkages between objects, using the Transmitter-supplied UUIDs as primary keys. This allows the Core to reconstruct the provenance graph at arbitrary starting points, forward and backward in time, by following the UUID references. The Core can also use file name, process name, and argument information captured by the Probulator to map between UUIDs and external names, allowing ES3 users to form queries in terms of objects they're familiar with.

Example Provenance Trace for

  1. User installs the Probulator and sets an environment variable to activate tracing
  2. User runs
  3. Logger writes log file to disk
  4. Transmitter processes log file and sends execution report to ES3 core
  5. ES3 core stores execution report in its database
  6. User requests provenance information for (or for the UUID under which the workflow was submitted)
  7. ES3 Core returns provenance information (in GraphML or ES3 XML format)
    1. If necessary, provenance report is post-processed for input to display tool
  8. Display tool (e.g. yEd) creates workflow DAG

Provenance Queries

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
thumbs up thumbs up thumbs up thumbs up + frown smile thumbs up thumbs up smile smile

Query 1

  1. Find UUID for object named "Atlas X Graphic".
  2. trace lineage backwards from corresponding UUID
  3. display results

Query 2

  1. Find UUID for object named "Atlas X Graphic".
  2. trace lineage backwards from corresponding UUID until object named "softmean" is encountered
  3. display results

Query 3

  1. Find UUID for object named "Atlas X Graphic".
  2. trace lineage backwards 5 links from corresponding UUID
  3. display results


The ES3 Core data model doesn't include a concept of workflow "stages". For this query we simply traced back five links (our interpretation of "Stages 3, 4, and 5" in the challenge workflow) from the "Atlas X Graphic" object. The lineage trace query uses a termination condition that states the trace should end after traversing five links from the starting UUID.

Query 4

  1. Find all ES3 transformation objects (i.e. processes) that have the specified name and command line arguments


The split score ( thumbs up + frown ) for this query is due to XQuery's lack of support for queries based on day-of-week.

Query 5

We did not implement Query 5, since the ES3 Probulator currently doesn't examine the contents of the objects it monitors. (See Further Comments below)

Query 6

  1. retrieve all align_warp transformations with arguments -m 12
  2. trace lineage forward to softmean
  3. retrieve file objects one lineage step forward from softmean

Query 7

  1. is modified as instructed.
    • We allow the added programs to communicate via pipes (as opposed to intermediate files).
    • We supply arbitrary arguments for pgmtoppm
  2. The modified workflow ( is probulated, saved in ES3, and retrieved as GraphML
  3. We use a simple home-brewed graph differencing tool to flag the differences between the original and modified graphs on a per-element basis (ignoring UUIDs) with a diff=[true|false] attribute.
  4. The flagged graphs are rendered, with differing portions marked by (in this example) red dashed lines.


Our solution to Query 7, while not implemented entirely as an ES3 Core query, is nevertheless responsive to one of the primary classes of user queries that ES3 as whole was designed to support; namely, "what changed?" queries. It's extremely common for scientists developing ad hoc workflows to notice differences in outputs across invocations between which "nothing was changed". Our graph-differencing approach is designed to answer the "what changed?" query as directly (and visually) as possible, while still allowing subsequent drill-down into the details.

Queries 8 and 9

We did not implement Queries 8 and 9, since the ES3 Core currently doesn't support annotations. (See Further Comments below)

Further Comments

ES3's provenance management currently concentrates on the automatic, transparent acquisition of structural provenance; i.e., reverse-engineering workflow. There is nothing that prevents one from storing in ES3 the additional content-based information required to by Queries 5, 8, and 9; however, we have not yet implemented a way to "slipstream" this information into the Probulator logs or Transmitter messages while remaining unobtrusive to the ES3 user. This is definitely within ES3's scope, which is why we've scored these queries smile , and is the part of ES3 currently being developed.

-- JamesFrew - 12 Sep 2006
to top

End of topic
Skip to action links | Back to top

I Attachment sort Action Size Date Who Comment
Presentation.ppt manage 853.0 K 13 Sep 2006 - 17:44 JamesFrew  

You are here: Challenge > FirstProvenanceChallenge > ParticipatingTeams > ES3

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.