Provenance Challenge Template
In progress
Participating Team
- Short team name: Karma
- Participant names: YogeshSimmhan, Beth Plale, Dennis Gannon
- Project URL: http://www.extreme.indiana.edu/karma
- Project Overview: Collecting Provenance in Data-Centric Scientific Workflows. Applied to the Linked Environments for Atmospheric Discovery (LEAD) project
- Provenance-specific Overview:
- Relevant Publications:
- A Framework for Collecting Provenance in Data-Centric Scientific Workflows, Y. L. Simmhan, B. Plale, and D. Gannon, International Conference on Web Service (ICWS) 2006.
- Performance Evaluation of the Karma Provenance Framework for Scientific Workflows, Y. L. Simmhan, B. Plale, D. Gannon, and S. Marru, Lecture Notes in Computer Science 4145 - 0222, 2006 & International Provenance and Annotation Workshop (IPAW), 2006
- A Survey of Data Provenance in e-Science, Y. L. Simmhan, B. Plale, and D. Gannon, SIGMOD Record, Vol. 34(3), 2005.
Workflow Representation
Provide here a description of how you have encoded the Challenge workflow.
Provenance Trace
Upload a representation of the information you captured when executing the workflow. Explain the structure (provide pointers to documents describing your schemas etc.)
Sa sample log of the provenance activities generated by the workflow/services is shown here
notifications.xml.
The Karma Service API supports 2 kinds of provenance retrieval: Data Provenance and Process Provenance. It also supports variations of these that can retrieve
RecursiveDataProvenance?,
DataUsage?, and
WorkflowTrace?. Results of these provenance queries on the given workflow are shown here:
- karma.xsd: Karma v2.x schema describing provenance documents
- workflow_trace.xml: Workflow Trace for all invocations in the ProvenanceChallengeBrainWorkflow
These query APIs form the building blocks for constructing the different "canonical" provenance queries in the challenge. Karma does not provide extensive support for annotations at the level of data products. We take the approach that the provenance system is not a generic metadata management system and should be focused mainly on storing and retreiving provenance. In the
LEAD project where Karma is used, queries over generic data product metadata and provenance are achieved by pushing the provenance into the metadata for the data product and allow the
MyLEAD metadata management system to answer the "join" queries.
Limited support for queries over annotations is present and has been used to answer the challenge queries that include annotations (except for #9). Some of them has required us to query the provenance service's backend relational database, since support for queries over annotation is not present through the service API yet.
Provenance Queries
For each query, if your system can support your query, provide a description of how you implement the query, what result is returned; otherwise, explain whether the query is in the remit of your system.
Also, make sure you complete the
ProvenanceQueriesMatrix.
Teams | Queries |
Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | Q9 |
Karma team | | | | * | * | * | | * | |
* Complete support not available through Karma's Web-Service API. SQL query on backend database required.
1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.
The
getRecursiveDataProvenance API provided by the Karma provenance service allows the retrieval of the entire data provenance history of a data product. Invoking that method with the data product ID of Atlas X Graphic (in this case,
'lead:uuid:1157946992-atlas-x.gif') returns the complete process that led to its creation. The result of the provenance query is shown in
recursive_data_provenance.xml.
2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.
This query is performed by the client by first invoking the
getDataProvenance method on the Karma provenance service to retreive the immediate data provenance for Atlas X Graphic. The client then recursively calls
getDataProvenance to get move up the provenance tree until the
SoftmeanService is encountered in the data provenance results. The
pseudo-code for the client looks like this:
PrintRecursiveDataProvenanceUntil('lead:uuid:1157946992-atlas-x.gif', 'urn:qname:...:SoftmeanService');
void PrintRecursiveDataProvenanceUntil(DataProductID dataProduct, URI processID)
1. let $dataList := [dataProduct]
2. while ($dataList != empty) do
a. $dataProvenance = karma.getDataProvenance($dataList[0]) // get data provenance for this level
b. Print $dataProvenance; $dataList.delete(0) // print process information & remove data from list
c. if ($dataProvenance.getProducedBy() == processID) break; // found Softmean. Stop.
d. foreach ($inputData in $dataProvenance.getUsingData()) do
// get input data used by this data product. recurse up the tree using iteration
i. $dataList.add($inputData)
3. End
The results of this operation is shown in
query2.txt.
3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.
This query is different from #2 in that the provenance levels are relative to the file, instead of being specified explicitly as 'Softmean'. The
getRecursiveDataProvenance API in the Karma provenance service has an optional parameter to specify the depth of recursion. By passing a recursion level of 3 in addition to the data product ID of Atlas X Graphic (in this case,
'lead:uuid:1157946992-atlas-x.gif'), it is possible to retreive the data provenance for stages 3,,4, and 5. The result of the provenance query is shown in
query3.xml.
4. Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.
The Karma provenance service is primarilly intended as a provenance recording and querying system, and only has limited capabiltiy for recording generic metadata and annotations. Provenance activities can have annotations and relevant activities also contain the messages that were exchanged by service and client to perform an operation. These activities are recorded in a relational database and free text queries are possible on the annotations using SQL queries. Direct SQL queries is currently not exposed to the client but provenance service has the capability to answer these queries as follows:
- SQL Query to locate align_warp invocations (invoker+invokee pairs) that match input parameter of "-m 12" that ran on a Monday
SELECT
invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
FROM
invocation_state_table invocation, entity_table invokee, entity_table invoker, notification_table notifications
WHERE
invokee.entity_id = invocation.invokee_id AND
invoker.entity_id = invocation.invoker_id AND
notifications.source_id = invocation.invokee_id AND
notifications.notification_type = 'ServiceInvoked' AND
invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
notifications.notification_xml LIKE '%<ModelMenuNumber>12</ModelMenuNumber>%' AND
DAYOFWEEK(invocation.request_receive_time) = 2; // 1=Sunday, 2=Monday, ...
In our example (assuming the workflow was run on a Monday instead of actually Sunday), this query returns:
Entity | workflow_id | service_id | workflow_node_id | workflow_timestep |
Invokee 1 | 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' | 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' | 'AlignWarpService' | 6 |
Invokee 2 | 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' | 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' | 'AlignWarpService_2' | 8 |
Invokee 3 | 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' | 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' | 'AlignWarpService_3' | 10 |
Invokee 4 | 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' | 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' | 'AlignWarpService_4' | 12 |
Invoker | - | 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' | - | - |
- Using the invoker and invokee information from the above query, the client can use the getProcessProvenance API to query for the description of the matching align_warp services. The result of this is show in query4.txt.
5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility.
In the workflow we execute, the command-line applications are wrapped by shell script that can perform pre- and post-processing. We incorporate a call to the scanheader utility within the wrapper for align_warp and have it include the output of the scanheader in the
ServiceInvoked activity's annotation. Now the query becomes similar to the previous case:
- SQL Query to locate align_warp invocations (invoker+invokee pairs) that have annotation of "global_maximum=4095"
SELECT
invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
FROM
entity_table invokee, entity_table invoker, notification_table notifications, invocation_state_table invocation
WHERE
invokee.entity_id = invocation.invokee_id AND
invoker.entity_id = invocation.invoker_id AND
notifications.source_id = invocation.invokee_id AND
notifications.notification_type = 'ServiceInvoked' AND
invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
notifications.notification_xml LIKE '%global_maximum=4095%'
In our example, this query returns:
Entity | workflow_id | service_id | workflow_node_id | workflow_timestep |
Invokee_1 | 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' | 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' | 'AlignWarpService' | 6 |
Invokee_2 | 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' | 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' | 'AlignWarpService_2' | 8 |
Invokee_3 | 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' | 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' | 'AlignWarpService_3' | 10 |
Invokee_4 | 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' | 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' | 'AlignWarpService_4' | 12 |
Invoker_0 | - | 'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1' | - | - |
- Using the invoker and invokee information from the above query, the client can start a recursive descent down the process provenance tree to look for output data files that are images generated by the convert service.
PrintRecursiveDataUsageFor(Invokee_0, Invokee_1, 'urn:qname:...:ConvertService');
void PrintRecursiveDataUsageFor(EntityID invoker, EntityID invokee, URI processID)
// get initial process's provenance
1. let $processProv := karma.getProcessProvenance(invoker, invokee)
1. let $processList := [$processProv], $visitedDataList := [], $outputDataList := []
// start recursing down the data usage tree iteratively
2. while ($processList != empty) do
a. foreach ($processProv in $processList) do
// test if any of the processes in the current list was 'ConvertService'. If so, print it's output image files.
i. if $processProv.getInvokee().getServiceID() == processID Print $processProv.getProducingData()
// add data products that were produced to the list of output to recurse into
ii. Add all $processProv.getProducingData() to $outputDataList
// we're done with these processes
b. $processList := []
c. foreach ($outputData in $outputDataList) do
// get the data usage list for the output data produced
i. let $dataUsage := karma.getDataUsage($outputData)
// get the process provenance for each process that used the output data and add them to process list
ii. foreach ($usedByProcess in $dataUsage.getUsageList())
- let $processProv := karma.getProcessProvenance($usedByProcess.invoker, $usedByProcess.invokee)
- Add $processProv to $processList
// we're done with these data
d. let $dataList := []
3. End
The results of this operation is shown in
query5.txt.
6. Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12."
This is a variation of query 4 and query 5. The SQL query used to retreive the align_warp services that had model menu number value of -12 is the same as the query in #4 with the exception of the DAYOFWEEK predicate. Similarly, the client's recursive procedure to locate output of all
SoftmeanServices? that were preceeded by these align_warps is similar to the recursive procedure outlined in query #5, with ConvertService being replaced by SoftmeanService. They're reproduced below.
-
SELECT
invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
FROM
invocation_state_table invocation, entity_table invokee, entity_table invoker, notification_table notifications
WHERE
invokee.entity_id = invocation.invokee_id AND
invoker.entity_id = invocation.invoker_id AND
notifications.source_id = invocation.invokee_id AND
notifications.notification_type = 'ServiceInvoked' AND
invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
notifications.notification_xml LIKE '%<ModelMenuNumber>12</ModelMenuNumber>%';
-
PrintRecursiveDataUsageFor(Invokee_0, Invokee_1, 'urn:qname:...:SoftmeanService');
(See Query #5 for definition)
The results of this operation is shown in
query6.txt.
7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.
The
getWorkflowTrace API if the Karma service returns the complete workflow trace for a workflow as an XML document. Given the workflow traces for two different workflows, it is possible to do a semantic "diff" of the two documents to find out the differences in the processes that were invoked and the data products used and produced, The pseudo-code for printing out the differences between two workflow traces is given below:
void PrintWorkflowTraceDiff(WorkflowTrace trace1, WorkflowTrace trace2)
// Workflow trace is an extension of process procenance document
1. let $processProv1 := trace1 as ProcessProvenance
2. let $processProv2 := trace2 as ProcessProvenance
3. PrintProcessProvenanceDiff($processProv1, $processProv2)
// Each step in the workflow trace is a process provenance document
4. foreach($processProv1, $processProv2 in trace1.getTraceSteps(), trace2.getTraceSteps()
a. PrintProcessProvenanceDiff($processProv1, $processProv2)
5. End
void PrintProcessProvenanceDiff(ProcessProvenance processProv1, ProcessProvenance processProv2)
1. Print "Diff of Processes: ", processProv1.getInvokee(), processProv2.getInvokee()
2. if (processProv1.getInvokee() != processProv2.getInvokee())
a. Print "Invokees Differ: ", processProv1.getInvokee(), processProv2.getInvokee()
3. if (processProv1.getInvoker() != processProv2.getInvoker())
a. Print "Invokers Differ: ", processProv1.getInvoker(), processProv2.getInvoker()
4. if (processProv1.getStatus() != processProv2.getStatus())
a. Print "Process Completion Status Differ: ", processProv1.getStatus(), processProv2.getInvoker()
5. if (processProv1.getRequestReceiveTime() != processProv2.getRequestReceiveTime())
a. Print "Invocation Times Differ: ", processProv1.getRequestReceiveTime(), processProv2.getRequestReceiveTime()
6. foreach ($dataProd1, $dataProd2 in processProv1.getUsingData(), processProv2.getUsingData())
a. PrintDataProductDiff($dataProd1, $dataProd2)
7. foreach ($dataProd1, $dataProd2 in processProv1.getProducingData(), processProv2.getProducingData())
a. PrintDataProductDiff($dataProd1, $dataProd2)
8. End
void PrintDataProductDiff(DataProduct dataProd1, DataProduct dataProd2)
1. if (dataProd1.getDataProductID() != dataProd2.getDataProductID()) // trivial. IDs always differ.
a. Print "Produced Data IDs Differ: ", dataProd1.getDataProductID(), dataProd2.getDataProductID()
2. if (dataProd1.getLocation() != dataProd2.getLocation())
a. Print "Produced Data Locations Differ: ", dataProd1.getLocation(), dataProd2.getLocation()
3. if (dataProd1.getTimestamp() != dataProd2.getTimestamp())
a. Print "Produced Data Timestamp Differ: ", dataProd1.getTimestamp(), dataProd2.getTimestamp()
4. End
The second workflow was not run and hence the query results for this are not available.
8. A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago.
As noted earlier, the Karma service does not support detailed annotations at the file level, defering to an external Metadata management system such as
MyLEAD. However, it supports generic annotations to be submitted as part of the provenance activities that can be queried upon. We use this facility to add metadata about the input anatomy images to the provenance activity and query it. This is again similar to queries #4, #5 and #6 in that a SQL query retrieves the invocations and we use the
getProcessProvenance API of Karma to retrieve the output data products.
- SQL Query to locate align_warp invocations (invoker+invokee pairs) whose input data products have annotaion "center=UChicago"
SELECT
invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
FROM
invocation_state_table invocation, entity_table invokee, entity_table invoker, notification_table notifications
WHERE
invokee.entity_id = invocation.invokee_id AND
invoker.entity_id = invocation.invoker_id AND
notifications.source_id = invocation.invokee_id AND
notifications.notification_type = 'ServiceInvoked' AND
invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
notifications.notification_xml LIKE '%<Center>UChicago</Center>%';
- We then call getProcessProvenance on the resulting invocations of the above query and print the produced data products elements. If all 4 align_warp services match, the results are shown in query8.txt.
9. A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.
The Karma service does not support complex queries such as these on the data product annotations. One way to perform this query would have been to retrieve the annotations for atlas graphics with key
studyModality having value
visual or
audio using a query similar to query #8 and then to filter out the keys at the client end. However, we do not expect to answer such queries through the provenance system and these will not be part of the provenance service API.
Suggested Wokflow Variants
Suggest variants of the workflow that can exhibit capabilities that your system support.
- Workflows with loops.
- Workflows whose structure changes dynamically (or, as a simpler case, workflows with conditional branches).
- Hierarchical composition of workflows. (workflows invoking other workflows)
Suggested Queries
Suggest significant queries that your system can support and are not in the proposed list of queries, and how you have implemented/would implement them. These queries may be with regards to a variant of the workflow suggested above.
- Find all [workflows | processes] with a particular execution status [completed | failed | waiting for input]
- Show the client view and service view of the provenance and check for differences
Categorisation of queries
According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale.
- Provenance Structure
- Annotation
Live systems
If your system can be accessed live (through portal, web page, web service, or other), provide relevant information here.
Further Comments
Provide here further comments.
Conclusions
Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.
--
YogeshSimmhan - 13 Sep 2006
to top