ES3
Participating Team
Differences from First Challenge
ES3 lineage trace schema
Data Model
The data model for ES3 contains only 4 types of objects: 1) files, 2) data transformations, 3) links 4) workflows.
File objects in ES3 represent files on disk that are read from or written to during the execution of the workflow. File objects
may be data files that are manipulated directly by the workflow or may be files that are read and written by the executables used
by the workflow, including operating system libraries, directories and temporary files.
File information may be filtered before being sent to ES3 using a configuration file, so that files that are not of interest to
the investigator are ignored, such as system libraries or temporary files that the workflow uses but are not of interest to the
investigator.
Data transformation objects are executable scripts or programs that are run during the execution of the workflow.
Link objects represent the connections between ES3 objects, for example between file objects and transformation objects.
A link has a single direction so each link defined a 'source' object and a 'destination' object. Links must be organized
as a Directed Acyclic Graph, such that no links point backward in the graph to create a loop.
A workflow object is the container in which all file, transformation and link objects belong. The workflow object represents
all objects that are used during an instance of scientific processing that begins when recording for a Unix process begins
and ends when that process exits.
Workflows can be connected to each other implicitly via that one workflow writes and another workflow reads. No explicit connection is
created between workflows.
A workflow may contain another workflow thereby creating a nested structure.
Provenance Data for Workflow Parts
- original workflow
- workflow from query 7 variation (only the third part is included as part 1 and part 2 are unchanged.)
Model Integration Results
We imported provenance data from the
PASS system and
VisTrails.
Translation Details
We wrote a translator to read a foreign provenance data file and translate it to ES3 objects which could then be sent to ES3.
ES3ingest.py is the translator script that was created for the translation step. Here is the command syntax and examples for ES3ingest.py:
Usage: ES3ingest.py -t foreign file type -e execution log file filename
example:
ES3ingest.py -t PASS challenge-D-mod.xml
ES3ingest.py -t VisTrails -e pc_log.xml pc_part3a.xml"
Translating PASS Provenance Data
Provenance data from the
PASS system was used for the first portion of the challenge workflow.
The data model used by
PASS is very similiar to the one used in ES3. The translation process involved converting
PASS 'PROC' objects into ES3 'transformation' objects and
PASS 'FILE' objects into ES3 'file' objects.
runCmd is a shell script that runs the translator program for the
PASS data.
lineageTrace-part1.graphml is the XML returned by an ES3 lineage query that shows the first portion of the challenge workflow.
[[http://eil.bren.ucsb.edu/ES3/SecondProvenanceChallenge/PHASE2/Teams/PASS/Results/lineageTrace-part1.png]lineageTrace-part1.png]] is a graphical rendering of an ES3 lineage query that shows the
PASS provenance data in ES3.
Using ES3 Provenance Data
ES3 provenance data was used for the second portion of the challenge workflow. This data was collected by running the provenance challenge workflow scripts while the probulator was monitoring them. The script run was 'workflow-part2.sh' which executed the command:
$AIR_DIR/bin/softmean atlas.hdr y null resliced1.img resliced2.img resliced3.img resliced4.img
The ES3 transmitter was then run, which send the information captured by the probulator to ES3.
lineageTrace-part2.graphml is the XML retuned by and ES3 lineage query that shows the second portion of the challenge workflow.]]
lineageTrace-part2 is a graphical rendering of an ES3 lineage query that shows the
VisTrails provenance data in ES3.
Translating VisTrails Provenance Data
Provenance data from the
VisTrails system was used for the third portion of the challenge workflow.
lineageTrace-part3.graphml is the XML returned by an ES3 lineage query that shows the third portion of the challenge workflow.
lineageTrace-part3.png is a graphical rendering of an ES3 lineage query that shows the
VisTrails provenance data in ES3.
lineageTrace-part3-Q7.graphml is the XML returned by an ES3 lineage query that shows the third portion of the challenge workflow.
lineageTrace-part3-Q7.png is a graphical rendering of an ES3 lineage query that shows the
VisTrails provenance data in ES3.
Combining parts of the workflow
The usual method for ES3 to combine workflows is via the files that they share. One workflow creates an output file, then subsequent workflows read these files. The md5sum calculated for these files and stored when the file is registerd is used to determine which files are common to workflows. Lineage queries will determine these common files and traverse workflows that share them.
If the md5sum is not provided however, such as with the provenance data from the Provenance Challenge, then the workflows have to be stitched
together manually creating "identity" links between common files.
The files
demonstrate how this was done to stitch together part1 to part2 and part3 to part3 of the workflow.
The file
shows is a graphical representation of a lineage query showing the combined workflow.
Benchmarks
We used the provenance queries from the first challenge as a benchmark, since these queries are well known to every team and results are
easily compared between the first and second challenge. The provenance queries used in the First Provenance Challenge were used successfully for this challenge without changes.
Provenance Queries
Query 1
- Find UUID for object named "Atlas X Graphic".
- trace lineage backwards from corresponding UUID
- display results
Query 2
- Find UUID for object named "Atlas X Graphic".
- trace lineage backwards from corresponding UUID until object named "softmean" is encountered
- display results
Query 3
- Find UUID for object named "Atlas X Graphic".
- trace lineage backwards 5 links from corresponding UUID
- display results
Discussion
The ES3 Core data model doesn't include a concept of workflow "stages". For this query we simply traced back five links (our interpretation of "Stages 3, 4, and 5" in the challenge workflow) from the "A
tlas X Graphic" object. The lineage trace query uses a termination condition that states the trace should end after traversing five links from the starting UUID.
Query 4
- Find all ES3 transformation objects (i.e. processes) that have the specified name and command line arguments
Discussion
The split score (
+
) for this query is due to XQuery's
lack of support for queries based on day-of-week.
Query 5
We did not implement Query 5, since the ES3 Probulator currently doesn't examine the
contents of the objects it monitors. (See
Further Comments below)
Query 6
- retrieve all
align_warp
transformations with arguments -m 12
- trace lineage forward to
softmean
- retrieve file objects one lineage step forward from
softmean
Query 7
Pending
Discussion
Our solution to Query 7, while not implemented entirely as an ES3 Core query, is nevertheless responsive to one of the primary classes of user queries that ES3 as whole was designed to support; namely, "what changed?" queries. It's extremely common for scientists developing
ad hoc workflows to notice differences in outputs across invocations between which "nothing was changed". Our graph-differencing approach is designed to answer the "what changed?" query as directly (and visually) as possible, while still allowing subsequent drill-down into the details.
Further Comments
The manual operation of stitching together the provenance data from different systems to make a complete workflow was cumbersome. ES3 can use md5sums to combine workflows, but md5sums are often an expensive operation and often this data is not collected. Another method of combining data should be found if it proves to be beneficial to combine dissimilar provenance data in the future.
Conclusions
Translating foreign provenance data and importing into ES3 was fairly straighforward. However, fully understanding another systems data model from exported data and documentation is an incomplete method, which affects the implementation of the translation process.
Interoperability would be facilitated by a common set of terms and possibly a common provenance data format.
--
JamesFrew - 25 June 2007
to top