Karma < Challenge

Challenge.Karma

Start of topic | Skip to actions

Provenance Challenge Template

In progress

Participating Team

Short team name: Karma
Participant names: YogeshSimmhan, Beth Plale, Dennis Gannon
Project URL: http://www.extreme.indiana.edu/karma
Project Overview: Collecting Provenance in Data-Centric Scientific Workflows. Applied to the Linked Environments for Atmospheric Discovery (LEAD) project
Provenance-specific Overview:
Relevant Publications:
- A Framework for Collecting Provenance in Data-Centric Scientific Workflows, Y. L. Simmhan, B. Plale, and D. Gannon, International Conference on Web Service (ICWS) 2006.
- Performance Evaluation of the Karma Provenance Framework for Scientific Workflows, Y. L. Simmhan, B. Plale, D. Gannon, and S. Marru, Lecture Notes in Computer Science 4145 - 0222, 2006 & International Provenance and Annotation Workshop (IPAW), 2006
- A Survey of Data Provenance in e-Science, Y. L. Simmhan, B. Plale, and D. Gannon, SIGMOD Record, Vol. 34(3), 2005.

Workflow Representation

Provide here a description of how you have encoded the Challenge workflow.

KarmaBrainAtlasWF-bpel.xml: BPEL Script for Workflow
KarmaBrainAtlasWF.xwf: Workflow representation that can be viewed/edited/launched from XBaya
- Download XBaya

Provenance Trace

Upload a representation of the information you captured when executing the workflow. Explain the structure (provide pointers to documents describing your schemas etc.)

Sa sample log of the provenance activities generated by the workflow/services is shown here notifications.xml.

The Karma Service API supports 2 kinds of provenance retrieval: Data Provenance and Process Provenance. It also supports variations of these that can retrieve RecursiveDataProvenance^?, DataUsage^?, and WorkflowTrace^?. Results of these provenance queries on the given workflow are shown here:

karma.xsd: Karma v2.x schema describing provenance documents

data_provenance.xml: Data Provenance retrieved for the atlas-x.gif data product

recursive_data_provenance.xml: Data Provenance retrieved recursively for the atlas-x.gif data product and its ancestral data products

process_provenance.xml: Process Provenance for a single service invocation of AlignWarpService

workflow_trace.xml: Workflow Trace for all invocations in the ProvenanceChallengeBrainWorkflow

These query APIs form the building blocks for constructing the different "canonical" provenance queries in the challenge. Karma does not provide extensive support for annotations at the level of data products. We take the approach that the provenance system is not a generic metadata management system and should be focused mainly on storing and retreiving provenance. In the LEAD project where Karma is used, queries over generic data product metadata and provenance are achieved by pushing the provenance into the metadata for the data product and allow the MyLEAD metadata management system to answer the "join" queries.

Limited support for queries over annotations is present and has been used to answer the challenge queries that include annotations (except for #9). Some of them has required us to query the provenance service's backend relational database, since support for queries over annotation is not present through the service API yet.

Provenance Queries

For each query, if your system can support your query, provide a description of how you implement the query, what result is returned; otherwise, explain whether the query is in the remit of your system.

Also, make sure you complete the ProvenanceQueriesMatrix.

Teams	Queries
Teams	Q1	Q2	Q3	Q4	Q5	Q6	Q7	Q8	Q9
Karma team				*	*	*		*

* Complete support not available through Karma's Web-Service API. SQL query on backend database required.

1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.

The getRecursiveDataProvenance API provided by the Karma provenance service allows the retrieval of the entire data provenance history of a data product. Invoking that method with the data product ID of Atlas X Graphic (in this case, 'lead:uuid:1157946992-atlas-x.gif') returns the complete process that led to its creation. The result of the provenance query is shown in recursive_data_provenance.xml.

2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.

This query is performed by the client by first invoking the getDataProvenance method on the Karma provenance service to retreive the immediate data provenance for Atlas X Graphic. The client then recursively calls getDataProvenance to get move up the provenance tree until the SoftmeanService is encountered in the data provenance results. The pseudo-code for the client looks like this:

PrintRecursiveDataProvenanceUntil('lead:uuid:1157946992-atlas-x.gif', 'urn:qname:...:SoftmeanService');

void PrintRecursiveDataProvenanceUntil(DataProductID dataProduct, URI processID)
1. let $dataList := [dataProduct]
2. while ($dataList != empty) do
   a. $dataProvenance = karma.getDataProvenance($dataList[0])           // get data provenance for this level
   b. Print $dataProvenance; $dataList.delete(0)                        // print process information & remove data from list
   c. if ($dataProvenance.getProducedBy() == processID) break;  // found Softmean. Stop.
   d. foreach ($inputData in $dataProvenance.getUsingData()) do 
      // get input data used by this data product. recurse up the tree using iteration
      i. $dataList.add($inputData)  
3. End

The results of this operation is shown in query2.txt.

3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.

This query is different from #2 in that the provenance levels are relative to the file, instead of being specified explicitly as 'Softmean'. The getRecursiveDataProvenance API in the Karma provenance service has an optional parameter to specify the depth of recursion. By passing a recursion level of 3 in addition to the data product ID of Atlas X Graphic (in this case, 'lead:uuid:1157946992-atlas-x.gif'), it is possible to retreive the data provenance for stages 3,,4, and 5. The result of the provenance query is shown in query3.xml.

4. Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.

The Karma provenance service is primarilly intended as a provenance recording and querying system, and only has limited capabiltiy for recording generic metadata and annotations. Provenance activities can have annotations and relevant activities also contain the messages that were exchanged by service and client to perform an operation. These activities are recorded in a relational database and free text queries are possible on the annotations using SQL queries. Direct SQL queries is currently not exposed to the client but provenance service has the capability to answer these queries as follows:

SQL Query to locate align_warp invocations (invoker+invokee pairs) that match input parameter of "-m 12" that ran on a Monday

   SELECT 
      invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
   invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
   FROM 
      invocation_state_table invocation, entity_table invokee, entity_table invoker, notification_table notifications
   WHERE
   invokee.entity_id = invocation.invokee_id AND
   invoker.entity_id = invocation.invoker_id AND
   notifications.source_id = invocation.invokee_id AND
   notifications.notification_type = 'ServiceInvoked' AND
   invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
   notifications.notification_xml LIKE '%<ModelMenuNumber>12</ModelMenuNumber>%' AND
   DAYOFWEEK(invocation.request_receive_time) = 2; // 1=Sunday, 2=Monday, ...

In our example (assuming the workflow was run on a Monday instead of actually Sunday), this query returns:

Entity	workflow_id	service_id	workflow_node_id	workflow_timestep
Invokee 1	'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1'	'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService'	'AlignWarpService'	6
Invokee 2	'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1'	'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService'	'AlignWarpService_2'	8
Invokee 3	'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1'	'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService'	'AlignWarpService_3'	10
Invokee 4	'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1'	'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService'	'AlignWarpService_4'	12
Invoker	-	'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1'	-	-

Using the invoker and invokee information from the above query, the client can use the getProcessProvenance API to query for the description of the matching align_warp services. The result of this is show in query4.txt.

5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility.

In the workflow we execute, the command-line applications are wrapped by shell script that can perform pre- and post-processing. We incorporate a call to the scanheader utility within the wrapper for align_warp and have it include the output of the scanheader in the ServiceInvoked activity's annotation. Now the query becomes similar to the previous case:

SQL Query to locate align_warp invocations (invoker+invokee pairs) that have annotation of "global_maximum=4095"

   SELECT
      invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
   invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
   FROM
      entity_table invokee, entity_table invoker, notification_table notifications, invocation_state_table invocation
   WHERE
   invokee.entity_id = invocation.invokee_id AND
   invoker.entity_id = invocation.invoker_id AND
   notifications.source_id = invocation.invokee_id AND
   notifications.notification_type = 'ServiceInvoked' AND
   invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
   notifications.notification_xml LIKE '%global_maximum=4095%'

In our example, this query returns:

Entity	workflow_id	service_id	workflow_node_id	workflow_timestep
Invokee_1	'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1'	'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService'	'AlignWarpService'	6
Invokee_2	'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1'	'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService'	'AlignWarpService_2'	8
Invokee_3	'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1'	'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService'	'AlignWarpService_3'	10
Invokee_4	'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1'	'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService'	'AlignWarpService_4'	12
Invoker_0	-	'tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1'	-	-

Using the invoker and invokee information from the above query, the client can start a recursive descent down the process provenance tree to look for output data files that are images generated by the convert service.

PrintRecursiveDataUsageFor(Invokee_0, Invokee_1, 'urn:qname:...:ConvertService');

void PrintRecursiveDataUsageFor(EntityID invoker, EntityID invokee, URI processID)
   // get initial process's provenance
1. let $processProv := karma.getProcessProvenance(invoker, invokee)       
1. let $processList := [$processProv], $visitedDataList := [], $outputDataList := []
   // start recursing down the data usage tree iteratively
2. while ($processList != empty) do
   a. foreach ($processProv in $processList) do 
          // test if any of the processes in the current list was 'ConvertService'. If so, print it's output image files.
      i.  if $processProv.getInvokee().getServiceID() == processID Print $processProv.getProducingData()
          // add data products that were produced to the list of output to recurse into
      ii. Add all $processProv.getProducingData() to $outputDataList
      // we're done with these processes
   b. $processList := []
   c. foreach ($outputData in $outputDataList) do 
          // get the data usage list for the output data produced
      i.  let $dataUsage := karma.getDataUsage($outputData)
          // get the process provenance for each process that used the output data and add them to process list
      ii. foreach ($usedByProcess in $dataUsage.getUsageList())
          - let $processProv := karma.getProcessProvenance($usedByProcess.invoker, $usedByProcess.invokee)
          - Add $processProv to $processList
      // we're done with these data
   d. let $dataList := []
3. End

The results of this operation is shown in query5.txt.

6. Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12."

This is a variation of query 4 and query 5. The SQL query used to retreive the align_warp services that had model menu number value of -12 is the same as the query in #4 with the exception of the DAYOFWEEK predicate. Similarly, the client's recursive procedure to locate output of all SoftmeanServices^? that were preceeded by these align_warps is similar to the recursive procedure outlined in query #5, with ConvertService being replaced by SoftmeanService. They're reproduced below.

   SELECT 
      invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
   invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
   FROM 
      invocation_state_table invocation, entity_table invokee, entity_table invoker, notification_table notifications
   WHERE
   invokee.entity_id = invocation.invokee_id AND
   invoker.entity_id = invocation.invoker_id AND
   notifications.source_id = invocation.invokee_id AND
   notifications.notification_type = 'ServiceInvoked' AND
   invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
   notifications.notification_xml LIKE '%<ModelMenuNumber>12</ModelMenuNumber>%';

PrintRecursiveDataUsageFor(Invokee_0, Invokee_1, 'urn:qname:...:SoftmeanService');

(See Query #5 for definition)

The results of this operation is shown in query6.txt.

7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.

The getWorkflowTrace API if the Karma service returns the complete workflow trace for a workflow as an XML document. Given the workflow traces for two different workflows, it is possible to do a semantic "diff" of the two documents to find out the differences in the processes that were invoked and the data products used and produced, The pseudo-code for printing out the differences between two workflow traces is given below:

void PrintWorkflowTraceDiff(WorkflowTrace trace1, WorkflowTrace trace2)
   // Workflow trace is an extension of process procenance document
1. let $processProv1 := trace1 as ProcessProvenance
2. let $processProv2 := trace2 as ProcessProvenance
3. PrintProcessProvenanceDiff($processProv1, $processProv2)
   // Each step in the workflow trace is a process provenance document
4. foreach($processProv1, $processProv2 in trace1.getTraceSteps(), trace2.getTraceSteps()
   a. PrintProcessProvenanceDiff($processProv1, $processProv2)
5. End

void PrintProcessProvenanceDiff(ProcessProvenance processProv1, ProcessProvenance processProv2)
1. Print "Diff of Processes: ", processProv1.getInvokee(), processProv2.getInvokee()
2. if (processProv1.getInvokee() != processProv2.getInvokee()) 
   a. Print "Invokees Differ: ", processProv1.getInvokee(), processProv2.getInvokee()   
3. if (processProv1.getInvoker() != processProv2.getInvoker()) 
   a. Print "Invokers Differ: ", processProv1.getInvoker(), processProv2.getInvoker()
4. if (processProv1.getStatus() != processProv2.getStatus()) 
   a. Print "Process Completion Status Differ: ", processProv1.getStatus(), processProv2.getInvoker()
5. if (processProv1.getRequestReceiveTime() != processProv2.getRequestReceiveTime()) 
   a. Print "Invocation Times Differ: ", processProv1.getRequestReceiveTime(), processProv2.getRequestReceiveTime()
6. foreach ($dataProd1, $dataProd2 in processProv1.getUsingData(), processProv2.getUsingData()) 
   a. PrintDataProductDiff($dataProd1, $dataProd2)
7. foreach ($dataProd1, $dataProd2 in processProv1.getProducingData(), processProv2.getProducingData()) 
   a. PrintDataProductDiff($dataProd1, $dataProd2)
8. End

void PrintDataProductDiff(DataProduct dataProd1, DataProduct dataProd2)
1. if (dataProd1.getDataProductID() != dataProd2.getDataProductID()) // trivial. IDs always differ.
   a. Print "Produced Data IDs Differ: ", dataProd1.getDataProductID(), dataProd2.getDataProductID()
2. if (dataProd1.getLocation() != dataProd2.getLocation()) 
   a. Print "Produced Data Locations Differ: ", dataProd1.getLocation(), dataProd2.getLocation()
3. if (dataProd1.getTimestamp() != dataProd2.getTimestamp()) 
   a. Print "Produced Data Timestamp Differ: ", dataProd1.getTimestamp(), dataProd2.getTimestamp()
4. End

The second workflow was not run and hence the query results for this are not available.

8. A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago.

As noted earlier, the Karma service does not support detailed annotations at the file level, defering to an external Metadata management system such as MyLEAD. However, it supports generic annotations to be submitted as part of the provenance activities that can be queried upon. We use this facility to add metadata about the input anatomy images to the provenance activity and query it. This is again similar to queries #4, #5 and #6 in that a SQL query retrieves the invocations and we use the getProcessProvenance API of Karma to retrieve the output data products.

SQL Query to locate align_warp invocations (invoker+invokee pairs) whose input data products have annotaion "center=UChicago"

   SELECT 
      invokee.workflow_id, invokee.service_id, invokee.workflow_node_id, invokee.workflow_timestep,
   invoker.workflow_id, invoker.service_id, invoker.workflow_node_id, invoker.workflow_timestep
   FROM 
      invocation_state_table invocation, entity_table invokee, entity_table invoker, notification_table notifications
   WHERE
   invokee.entity_id = invocation.invokee_id AND
   invoker.entity_id = invocation.invoker_id AND
   notifications.source_id = invocation.invokee_id AND
   notifications.notification_type = 'ServiceInvoked' AND
   invokee.service_id = 'urn:qname:http://www.extreme.indiana.edu/karma/challenge06:AlignWarpService' AND
   notifications.notification_xml LIKE '%<Center>UChicago</Center>%';

We then call getProcessProvenance on the resulting invocations of the above query and print the produced data products elements. If all 4 align_warp services match, the results are shown in query8.txt.

9. A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.

The Karma service does not support complex queries such as these on the data product annotations. One way to perform this query would have been to retrieve the annotations for atlas graphics with key studyModality having value visual or audio using a query similar to query #8 and then to filter out the keys at the client end. However, we do not expect to answer such queries through the provenance system and these will not be part of the provenance service API.

Suggested Wokflow Variants

Suggest variants of the workflow that can exhibit capabilities that your system support.

Workflows with loops.
Workflows whose structure changes dynamically (or, as a simpler case, workflows with conditional branches).
Hierarchical composition of workflows. (workflows invoking other workflows)

Suggested Queries

Suggest significant queries that your system can support and are not in the proposed list of queries, and how you have implemented/would implement them. These queries may be with regards to a variant of the workflow suggested above.

Find all [workflows | processes] with a particular execution status [completed | failed | waiting for input]
Show the client view and service view of the provenance and check for differences

Categorisation of queries

According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale.

Provenance Structure
Annotation

Live systems

If your system can be accessed live (through portal, web page, web service, or other), provide relevant information here.

Further Comments

Provide here further comments.

Conclusions

Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.

-- YogeshSimmhan - 13 Sep 2006
to top

End of topic
Skip to action links | Back to top

Attachment	Action	Size	Date	Who	Comment
KarmaBrainAtlasWF.gif	manage	254.0 K	11 Sep 2006 - 08:02	YogeshSimmhan	Karma's Brain Atlas Workflow Composition in BPEL using XBaya
KarmaBrainAtlasWF-bpel.xml	manage	31.8 K	11 Sep 2006 - 08:06	YogeshSimmhan	BPEL Script for Workflow
KarmaBrainAtlasWF.xwf	manage	192.1 K	11 Sep 2006 - 08:16	YogeshSimmhan	Workflow representation that can be viewed/edited/launched from XBaya
recursive_data_provenance.xml	manage	28.5 K	12 Sep 2006 - 02:45	YogeshSimmhan	Data Provenance retrieved recursively for a data product and its ancestral data products (Results of Query 1)
data_provenance.xml	manage	1.2 K	12 Sep 2006 - 02:45	YogeshSimmhan	Data Provenance retrieved for a data product
process_provenance.xml	manage	1.8 K	12 Sep 2006 - 02:46	YogeshSimmhan	Process Provenance for a single service invocation
workflow_trace.xml	manage	23.1 K	12 Sep 2006 - 02:46	YogeshSimmhan	Workflow Trace for all invocations in a workflow
karma.xsd	manage	13.1 K	12 Sep 2006 - 02:57	YogeshSimmhan	Karma v2.x schema describing provenance documents
query2.txt	manage	5.3 K	12 Sep 2006 - 03:45	YogeshSimmhan	Results of Query 2
query3.xml	manage	17.3 K	12 Sep 2006 - 03:46	YogeshSimmhan	Results of Query 3
query4.txt	manage	7.0 K	13 Sep 2006 - 13:44	YogeshSimmhan	Results of Query 4
query5.txt	manage	0.7 K	13 Sep 2006 - 13:44	YogeshSimmhan	Results of Query 5
query8.txt	manage	0.9 K	13 Sep 2006 - 13:46	YogeshSimmhan	Results of Query 8
notifications.xml	manage	123.2 K	13 Sep 2006 - 13:46	YogeshSimmhan	Sample Provenance Activity log generated by Workflow
karma.ppt	manage	904.0 K	13 Sep 2006 - 13:47	YogeshSimmhan	Presentation Draft
query6.txt	manage	0.4 K	13 Sep 2006 - 13:55	YogeshSimmhan	Results of Query 6

You are here: Challenge > FirstProvenanceChallenge > ParticipatingTeams > Karma

to top