Provenance Challenge: UCD/Genome Center
Participating Team
Team and Project Details
- Short team name: UCDGC
- Participant names: Manish Kumar Anand, Shawn Bowers, Bertram Ludaescher, Sven Kholer, Timothy McPhillips?, Sean Riddle
- Project Url: http://daks.ucdavis.edu/kepler-ppod/
- Project Overview: Comad-Kepler provides a complete set of provenance management features. COMAD (Collection-Oriented Modeling and Design) provenance framework supports nested data collections and captures explicit data dependencies introduced by provenance-aware processes at a fine-grained (node) level. COMAD framework automatically captured in a xml file (.trace), through a set of provenance annotations. COMAD framework has been further extended to automatically store provenance information in a relational database., where immediate and transitive closure dependencies for each node and invocation is stored, and the storage cost is reduced by applying a set of reduction techniques. We have also implemented a visualization tool, Provenance Browser, to allow users to browse and navigate workflow execution traces. Thus, COMAD-kepler provenance system provides support for creating scientific workflows, automatically capturing and storing provenance information, querying provenance information through a high-level query language and visualizing relevant views of provenance information.
- Relevant Publications:
Workflow Representation
The workflow is developed in the Kepler scientific workflow system. It uses the COMAD (Collection-Oriented Modeling and Design) framework, a system which allows nested collections of data to flow through a pipeline of actors, each actor interacting with only those parts of the stream it is interested in. This allows us to model the main aspect of control flow, the iteration over CSVFiles, as a functional map operation; this is very simple in COMAD. One challenge, though, was halting the entire workflow whenever any actor fails. While failures could easily be propagated downstream to actors 'below,' communicating them to upstream actors is challenging. To solve this, the classes simply communicate directly in an out-of-band fashion; if one fails, it informs all the rest, and they then act as passthroughs, just forwarding data and not operating on them further.
Open Provenance Model Output
Exporting COMAD-Kepler provenance traces into OPM traces
We ran the workflow across various data sets such that the workflow halts at each different step (8 halts) and in one case succeeds in executing all steps. The table below states those cases, comad-kepler provenance traces (xml) for the same, corresponding opm trace (xml), opm graph (png) and opm process dependency graph (png).
Importing OPM traces into COMAD-Kepler traces
We imported the opm output from various teams into our comad-kepler trace. We then stored it into our relational system (
MySQL?) to query the imported provenance information.
Query Results
The provenance information (trace) where stored in
MySQL?, and queries were written in SQL.
Core Query 1
For a given detection, which CSV files contributed to it?
Basic sample answer: The CSV file containing the Detection table.
Advanced sample answer: The CSV file containing the Detection table, CSV file containing the Image table (as the image is an attribute of the detection), and CSV file containing the FrameMetadata table (as the frame metadata is an attribute of the image).
SELECT value
FROM runAnnotationView
WHERE name="file"
AND targetNodeId IN (SELECT targetNodeId FROM runAnnotationView WHERE value LIKE '%Detection%')
;
Query Result:
+----------------------------------------------------------------------------------------------------------+
| value |
+----------------------------------------------------------------------------------------------------------+
|
/Users/manish/Desktop/pc3/datasets/J062941-success/P2_J062941_B001_P2fits0_20081115_P2Detection.csv |
+----------------------------------------------------------------------------------------------------------+
Core Query 2
The user considers a table to contain values they do not expect. Was the range check (IsMatchTableColumnRanges) performed for this table?
Sample answer: Yes
SELECT "true"
FROM runDataView rd1, runColldata r, runAnnotation ra
WHERE rd1.name="ccdID"
AND rd1.value="1"
AND r.nodeId=rd1.parent
AND ra.targetNodeId=r.parent
AND ra.name="IsMatchTableColumnRanges_success"
;
Query Result:
+------+
| true |
+------+
| true |
+------+
Core Query 3
Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?
Sample answer: call of ReadCSVReadyFile, call of CreateEmptyLoadDB, 2nd call of ReadCSVFileColumnNames, 2nd call of LoadCSVFileIntoTable (2nd calls because Image is loaded in the 2nd iteration of the for loop, excluded checks because they do not change anything, excluded UpdatedComputedColumns because it is non-computed, excluded CompactDatabase because it does not affect the value).
SELECT inv.invocName + inv.invocCount
FROM runitemView rv, dbValue dv, invokedAfterDep invD, invocation inv
WHERE rv.name="ccdID"
AND rv.dbValueId=dv.id
AND dv.value="1"
AND invD.toDbInvocId=rv.pIns
AND inv.dbInvocId=invD.fromDbInvocId
;
Query Result:
+---------------------------+------------+
| invocName | invocCount |
+---------------------------+------------+
| SetupTestInputs | 1 |
| IsCSVReadyFileExists | 1 |
| ReadCSVReadyFile | 1 |
| CreateCSVTableFiles | 1 |
| IsExistsCSVFile | 2 |
| ReadCSVFileColumnNames | 2 |
| IsMatchCSVFileColumnNames | 2 |
| LoadCSVFileIntoTable | 2 |
+---------------------------+------------+
Optional Queries ran over workflow execution "_J062941-halt8-IsMatchTableColumnRanges.trace_"
Optional Query #1
The workflow halts due to failing an IsMatchTableColumnRanges check. How many tables successfully loaded before the workflow halted due to a failed check?
Sample answer: 2
SELECT COUNT(*)
FROM runAnnotation
WHERE name="IsMatchTableColumnRanges_success"
;
Query Result:
+----------+
| COUNT(*) |
+----------+
| 2 |
+----------+
Optional Query #3
A CSV or header file is deleted during the workflow's execution. How much time expired between a successful IsMatchCSVFileTables test (when the file existed) and an unsuccessful IsExistsCSVFile? test (when the file had been deleted)?
Sample answer: 3ms
For testing the above query, we it may be simplest to edit the workflow to include deletion of the CSV file as a step.
SELECT ra2.value - ra1.value
FROM runAnnotationView ra1, runAnnotationView ra2
WHERE ra1.name="IsMatchCSVFileTables_success"
AND ra2.name="halt"
;
Query Result: Difference in time is in ms.
+-----------------------+
| ra2.value - ra1.value |
+-----------------------+
| 4795 |
+-----------------------+
Optional Query #4
Why is this entry in the database?
INVOCATION DEPENDENCY CLOSURE
SELECT inv.invocName , inv.invocCount
FROM runitemView rv, dbValue dv, invokedAfterDep invD, invocation inv
WHERE rv.name="ccdID"
AND rv.dbValueId=dv.id
AND dv.value="1"
AND invD.toDbInvocId=rv.pIns
AND inv.dbInvocId=invD.fromDbInvocId
;
Query Result:
+---------------------------+------------+
| invocName | invocCount |
+---------------------------+------------+
| SetupTestInputs | 1 |
| IsCSVReadyFileExists | 1 |
| ReadCSVReadyFile | 1 |
| CreateCSVTableFiles | 1 |
| IsExistsCSVFile | 2 |
| ReadCSVFileColumnNames | 2 |
| IsMatchCSVFileColumnNames | 2 |
| LoadCSVFileIntoTable | 2 |
+---------------------------+------------+
DATA DEPENDENCY CLOSURE
SELECT dcv.depNodeId
FROM runitemView rv, dbValue dv, runitemDepcView dcv
WHERE rv.dbValueId=dv.id
AND rv.name="ccdID"
AND dv.value="1"
AND dcv.nodeId=rv.nodeId
;
Query Result:
Empty set
(for this specific query)
Optional Query #5
A user executes the workflow many times (say 5 times) over different sets of data (j062941, j062942, ... j062945). He wants to determine, which of the execution halted?
Sample answer: workflow_execution_id along with input_data_set_name (say j062941, j062942)
SELECT wfDefName
FROM run r, runAnnotation ra
WHERE ra.name="halt"
AND ra.runId=r.id
;
Query Result:
+-----------------------------------------+
| wfDefName |
+-----------------------------------------+
| J062941-halt4-IsMatchCSVFileColumnNames |
| J062941-halt3-IsExistsCSVFile |
| J062941-halt1-IsCSVReadyFileExists |
| J062941-halt2-IsMatchCSVFileTables |
| J062941-halt8-IsMatchTableColumnRanges |
| J062941-halt7-IsMatchTableRowCount |
| J062941-halt5-LoadCSVFileIntoTable |
+-----------------------------------------+
Optional Query #6
Determine the step where halt occured?
Sample answer: 2nd call of ReadCSVFileColumnNames or 3rd call of LoadCSVFileIntoTable, etc.
SELECT inv.invocName , inv.invocCount
FROM runitemView rv, runAnnotation ra, invocation inv
WHERE ra.name="halt"
AND ra.nodeId=rv.nodeId
AND rv.pIns=inv.dbInvocId
;
Query Result:
+--------------------------+------------+
| invocName | invocCount |
+--------------------------+------------+
| IsMatchTableColumnRanges | 3 |
+--------------------------+------------+
Optional Query #7
Determine data and associated granularities of the data being processed, when halt occured?
Sample answer: data_id and associated granularities(column, row, table, file)
SELECT ra.name, ra.value
FROM runCollData r1, runCollData r2, runAnnotationView ra
WHERE r1.numLeft < r2.numLeft
AND r1.numRight > r2.numRight
AND r2.nodeId = (SELECT targetNodeId from runAnnotation WHERE name="halt")
AND r1.nodeId = ra.targetNodeId
;
Query Result:
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| name | value |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| db | J062941-halt8-IsMatchTableColumnRanges--927058851_LoadDB |
| file | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges/P2_J062941_B001_P2fits0_20081115_P2Detection.csv |
| sql_table | P2Detection |
| LoadCSVFileIntoTable_success | 1240629827414 |
| UpdateComputedColumns_success | 1240629828626 |
| IsMatchTableRowCount_success | 1240629828635 |
+-------------------------------+----------------------------------------------------------------------------------------------------------------------------+
Optional Query #8
Which steps were completed successfully before the halt occurred?
Sample answer: steps name along with call (or count) number, if steps were executed multiple times
SELECT inv.invocName , inv.invocCount
FROM runitemView rv, runAnnotation ra, invocation inv, invokedAfterDep invD
WHERE ra.name="halt"
AND ra.nodeId=rv.nodeId
AND invD.toDbInvocId=rv.pIns
AND inv.dbInvocId=invD.fromDbInvocId
;
Query Result:
+---------------------------+------------+
| invocName | invocCount |
+---------------------------+------------+
| SetupTestInputs | 1 |
| IsCSVReadyFileExists | 1 |
| ReadCSVReadyFile | 1 |
| CreateCSVTableFiles | 1 |
| IsExistsCSVFile | 3 |
| ReadCSVFileColumnNames | 3 |
| IsMatchCSVFileColumnNames | 3 |
| LoadCSVFileIntoTable | 3 |
| IsMatchTableColumnRanges | 3 |
+---------------------------+------------+
Optional Query #10
For a workflow execution, determine the user inputs?
Sample answer: jobID,
CVSRootPath?
SELECT r.nodeId, r.itemType, r.name, r.type, dv.value
FROM runitemView r, invocation inv, dbValue dv
WHERE r.pins = inv.dbInvocId
AND inv.invocName="SetupTestInputs"
AND dv.id=r.dbValueId
;
Query Result:
+--------+----------+----------+-------------+---------------------------------------------------------------------------+
| nodeId | itemType | name | type | value |
+--------+----------+----------+-------------+---------------------------------------------------------------------------+
| 7327 | data | db_name | StringToken | J062941-halt8-IsMatchTableColumnRanges--927058851 |
| 7328 | data | data_dir | StringToken | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges |
+--------+----------+----------+-------------+---------------------------------------------------------------------------+
Optional Query #11
For a workflow execution, determine steps that required user inputs?
Sample answer: IsCVSReadyFileExists?,
CreateEmptyLoadDB?
SELECT inv2.invocName, inv2.invocCount, dv.value
FROM runitemview r, invocation inv, dependencyView d, runitemView rv, dbValue dv, invocation inv2
WHERE r.pins = inv.dbInvocId
AND inv.invocName="SetupTestInputs"
AND d.nodeId=r.nodeId
AND rv.pDep=d.pDep
AND dv.id=r.dbvalueId
AND rv.pIns=inv2.dbInvocId
;
Query Result:
+----------------------+------------+---------------------------------------------------------------------------+
| invocName | invocCount | value |
+----------------------+------------+---------------------------------------------------------------------------+
| CreateEmptyLoadDB | 2 | J062941-halt8-IsMatchTableColumnRanges--927058851 |
| IsCSVReadyFileExists | 1 | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges |
| ReadCSVReadyFile | 1 | /Users/manish/Desktop/pc3/datasets/J062941-halt8-IsMatchTableColumnRanges |
+----------------------+------------+---------------------------------------------------------------------------+
Optional Query #13
_For a workflow execution, display the following provenance views: data dependency views, step dependency view?
Sample answer: either display these views in form of a directed graph, or as a pair of dependency relations between data or steps.
DATA DEPENDENCY VIEW (RELATIONS)
SELECT nodeId, depNodeId
FROM runitemDepView
;
Query Result:
Dependency Relations between nodes
PROCESS DEPENDENCY VIEW (RELATIONS)
SELECT fromdbInvocId, todbInvocId
FROM immedInvokedAfter
;
Query Result:
Dependency Relations between processes
Suggested Workflow Variants
Suggested Queries
Suggestions for Modification of the Open Provenance Model
Conclusions
--
ManishAnand - 25 Apr 2009
to top