UCDGC < Challenge

Challenge.UCDGC

Provenance Challenge: UCD/Genome Center

Participating Team

Team and Project Details

Short team name: UCDGC
Participant names: Manish Kumar Anand, Shawn Bowers, Bertram Ludaescher, Sven Kholer, Timothy McPhillips^?, Sean Riddle
Project Url: http://daks.ucdavis.edu/kepler-ppod/
Project Overview: Comad-Kepler provides a complete set of provenance management features. COMAD (Collection-Oriented Modeling and Design) provenance framework supports nested data collections and captures explicit data dependencies introduced by provenance-aware processes at a fine-grained (node) level. COMAD framework automatically captured in a xml file (.trace), through a set of provenance annotations. COMAD framework has been further extended to automatically store provenance information in a relational database., where immediate and transitive closure dependencies for each node and invocation is stored, and the storage cost is reduced by applying a set of reduction techniques. We have also implemented a visualization tool, Provenance Browser, to allow users to browse and navigate workflow execution traces. Thus, COMAD-kepler provenance system provides support for creating scientific workflows, automatically capturing and storing provenance information, querying provenance information through a high-level query language and visualizing relevant views of provenance information.
Relevant Publications:

Workflow Representation

The workflow is developed in the Kepler scientific workflow system. It uses the COMAD (Collection-Oriented Modeling and Design) framework, a system which allows nested collections of data to flow through a pipeline of actors, each actor interacting with only those parts of the stream it is interested in. This allows us to model the main aspect of control flow, the iteration over CSVFiles, as a functional map operation; this is very simple in COMAD. One challenge, though, was halting the entire workflow whenever any actor fails. While failures could easily be propagated downstream to actors 'below,' communicating them to upstream actors is challenging. To solve this, the classes simply communicate directly in an out-of-band fashion; if one fails, it informs all the rest, and they then act as passthroughs, just forwarding data and not operating on them further.

Open Provenance Model Output

Exporting COMAD-Kepler provenance traces into OPM traces

We ran the workflow across various data sets such that the workflow halts at each different step (8 halts) and in one case succeeds in executing all steps. The table below states those cases, comad-kepler provenance traces (xml) for the same, corresponding opm trace (xml), opm graph (png) and opm process dependency graph (png).

case	comad-kepler trace (xml)	opm trace (xml)	opm graph (png)	opm process dependency graph (png)
Halts at IsCSVReadyFileExists^?	J062941-halt1-IsCSVReadyFileExists.trace	J062941-halt1-IsCSVReadyFileExists.xml	J062941-halt1-IsCSVReadyFileExists.png	J062941-halt1-IsCSVReadyFileExists_Process.png
Halts at IsMatchCSVFileTables^?	J062941-halt2-IsMatchCSVFileTables.trace	J062941-halt2-IsMatchCSVFileTables.xml	J062941-halt2-IsMatchCSVFileTables.png	J062941-halt2-IsMatchCSVFileTables_Process.png
Halts at IsExistsCSVFile^?	J062941-halt3-IsExistsCSVFile.trace	J062941-halt3-IsExistsCSVFile.xml	J062941-halt3-IsExistsCSVFile.png	J062941-halt3-IsExistsCSVFile_Process.png
Halts at IsMatchCSVFileColumnNames^?	J062941-halt4-IsMatchCSVFileColumnNames.trace	J062941-halt4-IsMatchCSVFileColumnNames.xml	J062941-halt4-IsMatchCSVFileColumnNames.png	J062941-halt4-IsMatchCSVFileColumnNames_Process.png
Halts at LoadCSVFileIntoTable^?	J062941-halt5-LoadCSVFileIntoTable.trace	J062941-halt5-LoadCSVFileIntoTable.xml	J062941-halt5-LoadCSVFileIntoTable.png	J062941-halt5-LoadCSVFileIntoTable_Process.png
Halts at UpdateComputedColumns^?	J062941-halt6-UpdateComputedColumns.trace	J062941-halt6-UpdateComputedColumns.xml	J062941-halt6-UpdateComputedColumns.png	J062941-halt6-UpdateComputedColumns_Process.png
Halts at IsMatchTableRowCount^?	J062941-halt7-IsMatchTableRowCount.trace	J062941-halt7-IsMatchTableRowCount.xml	J062941-halt7-IsMatchTableRowCount.png	J062941-halt7-IsMatchTableRowCount_Process.png
Halts at IsMatchTableColumnRanges^?	J062941-halt8-IsMatchTableColumnRanges.trace	J062941-halt8-IsMatchTableColumnRanges.xml	J062941-halt8-IsMatchTableColumnRanges.png	J062941-halt8-IsMatchTableColumnRanges_Process.png
Succeeds	J062941-success.trace	J062941-success.xml	J062941-success.png	J062941-success_Process.png

Importing OPM traces into COMAD-Kepler traces

We imported the opm output from various teams into our comad-kepler trace. We then stored it into our relational system (MySQL^?) to query the imported provenance information.

Team	opm trace (xml)	comad-kepler trace (xml)
UCDGC	UCDGC_opm.xml	UCDGC_comad.trace
SotonUSCISIPc3	SotonUSCISIPc3_opm.xml	SotonUSCISIPc3_comad.trace
TetherlessPC3	TetherlessPC3_opm.xml	TetherlessPC3_comad.trace
KCL	KCL_opm.xml	KCL_comad.trace
PASS3	PASS3_opm.xml	PASS3_comad.trace

Query Results

The provenance information (trace) where stored in MySQL^?, and queries were written in SQL.

Core Query 1

For a given detection, which CSV files contributed to it?

Basic sample answer: The CSV file containing the Detection table.

Advanced sample answer: The CSV file containing the Detection table, CSV file containing the Image table (as the image is an attribute of the detection), and CSV file containing the FrameMetadata table (as the frame metadata is an attribute of the image).

SELECT value 
FROM runAnnotationView 
WHERE name="file" 
AND targetNodeId IN (SELECT targetNodeId FROM runAnnotationView WHERE value LIKE '%Detection%')
;

Query Result:

+----------------------------------------------------------------------------------------------------------+
| value                                                                                                    |
+----------------------------------------------------------------------------------------------------------+
| 
/Users/manish/Desktop/pc3/datasets/J062941-success/P2_J062941_B001_P2fits0_20081115_P2Detection.csv | 
+----------------------------------------------------------------------------------------------------------+

Core Query 2

The user considers a table to contain values they do not expect. Was the range check (IsMatchTableColumnRanges) performed for this table?

Sample answer: Yes

SELECT "true"
FROM runDataView rd1, runColldata r, runAnnotation ra 
WHERE rd1.name="ccdID" 
AND rd1.value="1"  
AND r.nodeId=rd1.parent
AND ra.targetNodeId=r.parent
AND ra.name="IsMatchTableColumnRanges_success"
;

Query Result:

+------+
| true |
+------+
| true | 
+------+

Core Query 3

Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?

Sample answer: call of ReadCSVReadyFile, call of CreateEmptyLoadDB, 2nd call of ReadCSVFileColumnNames, 2nd call of LoadCSVFileIntoTable (2nd calls because Image is loaded in the 2nd iteration of the for loop, excluded checks because they do not change anything, excluded UpdatedComputedColumns because it is non-computed, excluded CompactDatabase because it does not affect the value).

SELECT inv.invocName + inv.invocCount
FROM runitemView rv, dbValue dv, invokedAfterDep invD, invocation inv
WHERE rv.name="ccdID"
AND rv.dbValueId=dv.id
AND dv.value="1"
AND invD.toDbInvocId=rv.pIns
AND inv.dbInvocId=invD.fromDbInvocId
;

Query Result:

+---------------------------+------------+
| invocName                 | invocCount |
+---------------------------+------------+
| SetupTestInputs           |          1 | 
| IsCSVReadyFileExists      |          1 | 
| ReadCSVReadyFile          |          1 | 
| CreateCSVTableFiles       |          1 | 
| IsExistsCSVFile           |          2 | 
| ReadCSVFileColumnNames    |          2 | 
| IsMatchCSVFileColumnNames |          2 | 
| LoadCSVFileIntoTable      |          2 | 
+---------------------------+------------+

Optional Queries ran over workflow execution "_J062941-halt8-IsMatchTableColumnRanges.trace_"

Optional Query #1

The workflow halts due to failing an IsMatchTableColumnRanges check. How many tables successfully loaded before the workflow halted due to a failed check?

Sample answer: 2

SELECT COUNT(*) 
FROM runAnnotation 
WHERE name="IsMatchTableColumnRanges_success"
;

Query Result:

+----------+
| COUNT(*) |
+----------+
|        2 | 
+----------+

Optional Query #3

A CSV or header file is deleted during the workflow's execution. How much time expired between a successful IsMatchCSVFileTables test (when the file existed) and an unsuccessful IsExistsCSVFile^? test (when the file had been deleted)?

Sample answer: 3ms

For testing the above query, we it may be simplest to edit the workflow to include deletion of the CSV file as a step.

SELECT ra2.value - ra1.value 
FROM runAnnotationView ra1, runAnnotationView ra2
WHERE ra1.name="IsMatchCSVFileTables_success"
AND ra2.name="halt"
;

Query Result: Difference in time is in ms.

+-----------------------+
| ra2.value - ra1.value |
+-----------------------+
|                  4795 | 
+-----------------------+

Optional Query #4

Why is this entry in the database?

INVOCATION DEPENDENCY CLOSURE

SELECT inv.invocName , inv.invocCount
FROM runitemView rv, dbValue dv, invokedAfterDep invD, invocation inv
WHERE rv.name="ccdID"
AND rv.dbValueId=dv.id
AND dv.value="1"
AND invD.toDbInvocId=rv.pIns
AND inv.dbInvocId=invD.fromDbInvocId
;

Query Result:

+---------------------------+------------+
| invocName                 | invocCount |
+---------------------------+------------+
| SetupTestInputs           |          1 | 
| IsCSVReadyFileExists      |          1 | 
| ReadCSVReadyFile          |          1 | 
| CreateCSVTableFiles       |          1 | 
| IsExistsCSVFile           |          2 | 
| ReadCSVFileColumnNames    |          2 | 
| IsMatchCSVFileColumnNames |          2 | 
| LoadCSVFileIntoTable      |          2 | 
+---------------------------+------------+

DATA DEPENDENCY CLOSURE

SELECT dcv.depNodeId
FROM runitemView rv, dbValue dv, runitemDepcView dcv
WHERE rv.dbValueId=dv.id
AND rv.name="ccdID"
AND dv.value="1"
AND dcv.nodeId=rv.nodeId
;

Query Result:


Empty set 
(for this specific query)

Optional Query #5

A user executes the workflow many times (say 5 times) over different sets of data (j062941, j062942, ... j062945). He wants to determine, which of the execution halted?

Sample answer: workflow_execution_id along with input_data_set_name (say j062941, j062942)

SELECT wfDefName
FROM run r, runAnnotation ra
WHERE ra.name="halt"
AND ra.runId=r.id
;

Query Result:

+-----------------------------------------+
| wfDefName                               |
+-----------------------------------------+
| J062941-halt4-IsMatchCSVFileColumnNames | 
| J062941-halt3-IsExistsCSVFile           | 
| J062941-halt1-IsCSVReadyFileExists      | 
| J062941-halt2-IsMatchCSVFileTables      | 
| J062941-halt8-IsMatchTableColumnRanges  | 
| J062941-halt7-IsMatchTableRowCount      | 
| J062941-halt5-LoadCSVFileIntoTable      | 
+-----------------------------------------+

Optional Query #6

Determine the step where halt occured?

Sample answer: 2nd call of ReadCSVFileColumnNames or 3rd call of LoadCSVFileIntoTable, etc.

SELECT inv.invocName , inv.invocCount
FROM runitemView rv, runAnnotation ra, invocation inv
WHERE ra.name="halt"
AND ra.nodeId=rv.nodeId
AND rv.pIns=inv.dbInvocId
;

Query Result:

+--------------------------+------------+
| invocName                | invocCount |
+--------------------------+------------+
| IsMatchTableColumnRanges |          3 | 
+--------------------------+------------+

Optional Query #7

Determine data and associated granularities of the data being processed, when halt occured?

Sample answer: data_id and associated granularities(column, row, table, file)

SELECT ra.name, ra.value
FROM runCollData r1, runCollData r2, runAnnotationView ra
WHERE r1.numLeft < r2.numLeft
AND r1.numRight > r2.numRight
AND r2.nodeId = (SELECT targetNodeId from runAnnotation WHERE name="halt")
AND r1.nodeId = ra.targetNodeId
;

Query Result:

+-------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| name                          | value                                                                                                                      |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| db                            | J062941-halt8-IsMatchTableColumnRanges--927058851_LoadDB                                                                   | 
| file                          | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges/P2_J062941_B001_P2fits0_20081115_P2Detection.csv | 
| sql_table                     | P2Detection                                                                                                                | 
| LoadCSVFileIntoTable_success  | 1240629827414                                                                                                              | 
| UpdateComputedColumns_success | 1240629828626                                                                                                              | 
| IsMatchTableRowCount_success  | 1240629828635                                                                                                              | 
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------+

Optional Query #8

Which steps were completed successfully before the halt occurred?

Sample answer: steps name along with call (or count) number, if steps were executed multiple times

SELECT inv.invocName , inv.invocCount
FROM runitemView rv, runAnnotation ra, invocation inv, invokedAfterDep invD
WHERE ra.name="halt"
AND ra.nodeId=rv.nodeId
AND invD.toDbInvocId=rv.pIns
AND inv.dbInvocId=invD.fromDbInvocId
;

Query Result:

+---------------------------+------------+
| invocName                 | invocCount |
+---------------------------+------------+
| SetupTestInputs           |          1 | 
| IsCSVReadyFileExists      |          1 | 
| ReadCSVReadyFile          |          1 | 
| CreateCSVTableFiles       |          1 | 
| IsExistsCSVFile           |          3 | 
| ReadCSVFileColumnNames    |          3 | 
| IsMatchCSVFileColumnNames |          3 | 
| LoadCSVFileIntoTable      |          3 | 
| IsMatchTableColumnRanges  |          3 | 
+---------------------------+------------+

Optional Query #10

For a workflow execution, determine the user inputs?

Sample answer: jobID, CVSRootPath^?

SELECT r.nodeId, r.itemType, r.name, r.type, dv.value 
FROM runitemView r, invocation inv, dbValue dv 
WHERE r.pins = inv.dbInvocId
AND inv.invocName="SetupTestInputs"
AND dv.id=r.dbValueId
;

Query Result:

+--------+----------+----------+-------------+---------------------------------------------------------------------------+
| nodeId | itemType | name     | type        | value                                                                     |
+--------+----------+----------+-------------+---------------------------------------------------------------------------+
|   7327 | data     | db_name  | StringToken | J062941-halt8-IsMatchTableColumnRanges--927058851                         | 
|   7328 | data     | data_dir | StringToken | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges | 
+--------+----------+----------+-------------+---------------------------------------------------------------------------+

Optional Query #11

For a workflow execution, determine steps that required user inputs?

Sample answer: IsCVSReadyFileExists^?, CreateEmptyLoadDB^?

SELECT inv2.invocName, inv2.invocCount, dv.value
FROM runitemview r, invocation inv, dependencyView d, runitemView rv, dbValue dv, invocation inv2 
WHERE r.pins = inv.dbInvocId
AND inv.invocName="SetupTestInputs"
AND d.nodeId=r.nodeId
AND rv.pDep=d.pDep
AND dv.id=r.dbvalueId
AND rv.pIns=inv2.dbInvocId
;

Query Result:

+----------------------+------------+---------------------------------------------------------------------------+
| invocName            | invocCount | value                                                                     |
+----------------------+------------+---------------------------------------------------------------------------+
| CreateEmptyLoadDB    |          2 | J062941-halt8-IsMatchTableColumnRanges--927058851                         | 
| IsCSVReadyFileExists |          1 | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges | 
| ReadCSVReadyFile     |          1 | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges | 
+----------------------+------------+---------------------------------------------------------------------------+

Optional Query #13

_For a workflow execution, display the following provenance views: data dependency views, step dependency view?

Sample answer: either display these views in form of a directed graph, or as a pair of dependency relations between data or steps.

DATA DEPENDENCY VIEW (RELATIONS)

SELECT nodeId, depNodeId 
FROM runitemDepView 
;

Query Result:

Dependency Relations between nodes

PROCESS DEPENDENCY VIEW (RELATIONS)

SELECT fromdbInvocId, todbInvocId
FROM immedInvokedAfter
;

Query Result:

Dependency Relations between processes

Suggested Workflow Variants

Suggested Queries

Suggestions for Modification of the Open Provenance Model

Conclusions

-- ManishAnand - 25 Apr 2009
to top

End of topic
Skip to action links | Back to top

Attachment	Action	Size	Date	Who
pc3-comad-revised.pdf	manage	104.1 K	21 Apr 2009 - 23:40	SeanRiddle
pc3-comad-revised.png	manage	131.4 K	21 Apr 2009 - 23:40	SeanRiddle
pc3-comad-revised.xml	manage	19.2 K	21 Apr 2009 - 23:41	SeanRiddle
J062941-success.trace	manage	198.3 K	25 Apr 2009 - 04:51	ManishAnand
J062941-success.xml	manage	425.2 K	25 Apr 2009 - 14:49	ManishAnand
J062941-success_Process.png	manage	165.6 K	07 May 2009 - 21:43	ManishAnand
J062941-success.png	manage	443.1 K	25 Apr 2009 - 04:58	ManishAnand
J062941-halt1-IsCSVReadyFileExists.trace	manage	6.0 K	25 Apr 2009 - 05:09	ManishAnand
J062941-halt1-IsCSVReadyFileExists.xml	manage	2.9 K	25 Apr 2009 - 14:54	ManishAnand
J062941-halt1-IsCSVReadyFileExists_Process.png	manage	12.2 K	25 Apr 2009 - 05:23	ManishAnand
J062941-halt1-IsCSVReadyFileExists.png	manage	32.3 K	25 Apr 2009 - 05:35	ManishAnand
J062941-halt2-IsMatchCSVFileTables.trace	manage	9.8 K	25 Apr 2009 - 05:14	ManishAnand
J062941-halt3-IsExistsCSVFile.trace	manage	13.1 K	25 Apr 2009 - 05:15	ManishAnand
J062941-halt4-IsMatchCSVFileColumnNames.trace	manage	20.3 K	25 Apr 2009 - 05:15	ManishAnand
J062941-halt5-LoadCSVFileIntoTable.trace	manage	56.7 K	25 Apr 2009 - 05:15	ManishAnand
J062941-halt6-UpdateComputedColumns.trace	manage	198.6 K	25 Apr 2009 - 05:15	ManishAnand
J062941-halt7-IsMatchTableRowCount.trace	manage	196.2 K	25 Apr 2009 - 05:16	ManishAnand
J062941-halt8-IsMatchTableColumnRanges.trace	manage	198.0 K	25 Apr 2009 - 05:16	ManishAnand
J062941-small.trace	manage	14.8 K	25 Apr 2009 - 05:16	ManishAnand
J062941-halt2-IsMatchCSVFileTables.xml	manage	12.1 K	25 Apr 2009 - 14:54	ManishAnand
J062941-halt3-IsExistsCSVFile.xml	manage	19.9 K	25 Apr 2009 - 14:54	ManishAnand
J062941-halt4-IsMatchCSVFileColumnNames.xml	manage	33.2 K	25 Apr 2009 - 14:54	ManishAnand
J062941-halt5-LoadCSVFileIntoTable.xml	manage	114.8 K	25 Apr 2009 - 14:53	ManishAnand
J062941-halt6-UpdateComputedColumns.xml	manage	425.4 K	25 Apr 2009 - 14:52	ManishAnand
J062941-halt7-IsMatchTableRowCount.xml	manage	419.8 K	25 Apr 2009 - 14:52	ManishAnand
J062941-halt8-IsMatchTableColumnRanges.xml	manage	424.0 K	25 Apr 2009 - 14:51	ManishAnand
J062941-halt2-IsMatchCSVFileTables_Process.png	manage	26.4 K	25 Apr 2009 - 05:24	ManishAnand
J062941-halt3-IsExistsCSVFile_Process.png	manage	68.7 K	25 Apr 2009 - 05:24	ManishAnand
J062941-halt4-IsMatchCSVFileColumnNames_Process.png	manage	90.5 K	25 Apr 2009 - 05:26	ManishAnand
J062941-halt5-LoadCSVFileIntoTable.png	manage	3191.6 K	25 Apr 2009 - 05:47	ManishAnand
J062941-halt5-LoadCSVFileIntoTable_Process.png	manage	87.6 K	25 Apr 2009 - 05:29	ManishAnand
J062941-halt6-UpdateComputedColumns_Process.png	manage	111.0 K	25 Apr 2009 - 05:30	ManishAnand
J062941-halt7-IsMatchTableRowCount_Process.png	manage	93.5 K	25 Apr 2009 - 05:31	ManishAnand
J062941-halt8-IsMatchTableColumnRanges_Process.png	manage	105.8 K	25 Apr 2009 - 05:32	ManishAnand
J062941-small.png	manage	428.0 K	25 Apr 2009 - 05:33	ManishAnand
J062941-small_Process.png	manage	88.8 K	25 Apr 2009 - 05:34	ManishAnand
J062941-halt2-IsMatchCSVFileTables.png	manage	192.1 K	25 Apr 2009 - 05:37	ManishAnand
J062941-halt3-IsExistsCSVFile.png	manage	364.8 K	25 Apr 2009 - 05:37	ManishAnand
J062941-halt4-IsMatchCSVFileColumnNames.png	manage	654.8 K	25 Apr 2009 - 05:39	ManishAnand
J062941-halt6-UpdateComputedColumns.png	manage	390.2 K	25 Apr 2009 - 05:46	ManishAnand
J062941-halt7-IsMatchTableRowCount.png	manage	535.9 K	25 Apr 2009 - 05:50	ManishAnand
J062941-halt8-IsMatchTableColumnRanges.png	manage	390.6 K	25 Apr 2009 - 05:54	ManishAnand
J062941-small.xml	manage	23.5 K	25 Apr 2009 - 14:51	ManishAnand
UCDGC_comad.trace	manage	195.0 K	27 Apr 2009 - 23:00	ManishAnand
SotonUSCISIPc3_comad.trace	manage	18.3 K	27 Apr 2009 - 23:01	ManishAnand
TetherlessPC3_comad.trace	manage	13.3 K	27 Apr 2009 - 23:01	ManishAnand
PASS3_comad.trace	manage	1610.8 K	27 Apr 2009 - 23:12	ManishAnand
SotonUSCISIPc3_opm.xml	manage	117.9 K	27 Apr 2009 - 23:21	ManishAnand
TetherlessPC3_opm.xml	manage	38.6 K	27 Apr 2009 - 23:21	ManishAnand
PASS3_opm.xml	manage	5618.4 K	27 Apr 2009 - 23:21	ManishAnand
KCL_opm.xml	manage	350.8 K	27 Apr 2009 - 23:22	ManishAnand
KCL_comad.trace	manage	160.4 K	27 Apr 2009 - 23:32	ManishAnand
UCDGC_opm.xml	manage	425.2 K	27 Apr 2009 - 23:41	ManishAnand

You are here: Challenge > ThirdProvenanceChallenge > ParticipatingTeams3 > UCDGC

to top