Provenance Challenge: SDSC
Participating Team
Team and Project Details
Workflow Representation
Open Provenance Model Output
Our
OPM output is
here for a successful execution, and
here for a failed execution (
IsExistsCSVFile? fails). The output is XML using the
OPM v1.01.a schema by Paul Groth and Luc Moreau.
Here is the opm2dot graph for the successful execution.
Query Results
We implemented our queries in XQuery 1.0. For each query, we load the provenance XML document into
$graph. Additionally, we created a library (called opmLib) that contains utility queries, such as getAllAncestorProcesses() and getArtifactIdsThatContainsValue().
Query 1
For a given detection, which CSV files contributed to it?
LoadCSVFileIntoTable? tells the database to import the detections directly from a file. Since we did not instrument the database, we added an output to
LoadCSVFileIntoTable?, called Detections, which outputs the detection values. We can then query for a specific detection value, e.g., 261887437030025141.
(: get the artifact id containing the detection :)
let $artifactId := opmLib:getArtifactIdsThatContainsValue($graph, "261887437030025141")
(: get the process that generated it :)
let $inputs := opmLib:getImmediateAncestorUseds($graph, $artifactId)
(: get the artifact with role FileEntry :)
for $used in $inputs, $artifact in $graph/artifacts/artifact
where $used/role/@value = "FileEntry" and
$used/cause/@id = $artifact/@id
return $artifact/value
The output is a
FileEntry? used by
LoadCSVFileIntoTable?. This is a composite artifact, and support for accessing sub-artifacts would allow extracting the file name.
Output:
<value>
{Checksum = "f8f9d70711cb3a1cb8b359d99d98fa63",
ColumnNames = {"objID", "detectID", "ippObjID", "ippDetectID", "filterID", "imageID", "obsTime", "xPos", "yPos", "xPosErr", "yPosErr", "instFlux",
"instFluxErr", "psfWidMajor", "psfWidMinor", "psfTheta", "psfLikelihood", "psfCf", "infoFlag", "htmID", "zoneID", "assocDate", "modNum", "ra",
"dec", "raErr", "decErr", "cx", "cy", "cz", "peakFlux", "calMag", "calMagErr", "calFlux", "calFluxErr", "calColor", "calColorErr", "sky",
"skyErr", "sgSep", "dataRelease"},
FilePath = "pc3/workflows/data/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv",
HeaderPath = "pc3/workflows/data/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv.hdr",
RowCount = 20,
TargetTable = "P2Detection"}
</value>
Query 2
The user considers a table to contain values they do not expect. Was the range check (IsMatchTableColumnRanges) performed for this table?
(: find artifact containing table name :)
let $artifactIds := opmLib:getArtifactIdsThatContainsValue($graph, 'TargetTable = "P2Detection"')
(: find the one used by LoadCSVFileIntoTable :)
let $artifactId := opmLib:getArtifactIdsUsedByProcessValue($graph, $artifactIds, "LoadCSVFileIntoTable")
(: see if any descendant processes were IsMatchTableColumnRanges :)
let $found := (for $process in opmLib:getAllDescendantProcesses($graph, $artifactId)
where contains($process/value, "IsMatchTableColumnRanges")
return $process)
return if(count($found) = 0) then "no" else "yes"
Output:
yes
Query 3
Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?
(: find artifacts containing the image table name :)
let $artifactIds := opmLib:getArtifactIdsThatContainsValue($graph, 'TargetTable = "P2ImageMeta"')
(: get the artifact id that was used by LoadCSVFileIntoTable :)
let $id := (for $id in $artifactIds,
$used in $graph/causalDependencies/used,
$process in $graph/processes/process
where $id = $used/cause/@id and
$used/effect/@id = $process/@id and
contains($process/value, "LoadCSVFileIntoTable")
return $id)
(: return all processes that led to that artifact :)
return opmLib:getAllAncestorProcesses($graph, $id)
Output:
<process id="_p0">
<value>.load.IsCSVReadyFileExists fire 0</value>
</process>
<process id="_p1">
<value>.load.StopOnFalse fire 0</value>
</process>
<process id="_p2">
<value>.load.ReadCSVReadyFile fire 0</value>
</process>
<process id="_p3">
<value>.load.IsMatchCSVFileTables fire 0</value>
</process>
<process id="_p4">
<value>.load.StopOnFalse2 fire 0</value>
</process>
<process id="_p5">
<value>.load.CreateEmptyLoadDB fire 0</value>
</process>
<process id="_p6">
<value>.load.Array Permute fire 0</value>
</process>
<process id="_p8">
<value>.load.ForEach.in fire 0</value>
</process>
<process id="_p27">
<value>.load.ForEach.CompositeActor.in fire 3</value>
</process>
<process id="_p28">
<value>.load.ForEach.CompositeActor.Record Disassembler fire 1</value>
</process>
<process id="_p29">
<value>.load.ForEach.CompositeActor.IsExistsCSVFile fire 1</value>
</process>
<process id="_p30">
<value>.load.ForEach.CompositeActor.StopOnFalse fire 1</value>
</process>
<process id="_p31">
<value>.load.ForEach.CompositeActor.ReadCSVFileColumnNames fire 1</value>
</process>
Optional Query 1
The workflow halts due to failing an IsMatchTableColumnRanges check. How many tables successfully loaded before the workflow halted due to a failed check?
(: count how many times IsMatchTableColumnRangesOutput was executed. :)
let $num := count(for $wgbs in $graph/causalDependencies/wasGeneratedBy
where $wgbs/role/@value = "IsMatchTableColumnRangesOutput"
return $wgbs)
(: since it halted, n - 1 tables were loaded. :)
return $num - 1
Output:
2
Optional Query 3
A CSV or header file is deleted during the workflow's execution. How much time expired between a successful IsMatchCSVFileTables test (when the file existed) and an unsuccessful IsExistsCSVFile? test (when the file had been deleted)?
(: find the wasGeneratedBy of the false output from IsExistsCSVFile :)
let $fail := opmLib:getWasGeneratedBy($graph, "IsExistsCSVFile", "false")/time
(: find the wasGeneratedBy of the true output from IsMatchCSVFileTables :)
let $ok := opmLib:getWasGeneratedBy($graph, "IsMatchCSVFileTables", "true")/time
(: return elapsed seconds :)
let $diff := xs:time($fail/noLaterThan) - xs:time($ok/noEarlierThan)
return $diff div xs:dayTimeDuration('PT1S')
Output:
1.562
Optional Query 6
Determine the step where halt occured?
(: get the last used or wasGeneratedBy relation :)
let $last:= $graph/causalDependencies/(used|wasGeneratedBy)[last()]
let $processId := if(name($last) = "used") then $last/effect/@id else $last/cause/@id
return $graph/processes/process[@id=$processId]
Output:
<process id="_p13">
<value>.load-for-opt-query3.ForEach.CompositeActor.StopOnFalse fire 0</value>
</process>
Optional Query 8
Which steps were completed successfully before the halt occurred?
(: get the last used or wasGeneratedBy relation :)
let $last:= $graph/causalDependencies/(used|wasGeneratedBy)[last()]
let $artifactId := if(name($last) = "used") then $last/cause/@id else $last/effect/@id
return opmLib:getAllAncestorProcesses($graph, $artifactId)
Output:
<process id="_p0">
<value>.load-for-opt-query3.IsCSVReadyFileExists fire 0</value>
</process>
<process id="_p1">
<value>.load-for-opt-query3.StopOnFalse fire 0</value>
</process>
<process id="_p2">
<value>.load-for-opt-query3.ReadCSVReadyFile fire 0</value>
</process>
<process id="_p3">
<value>.load-for-opt-query3.IsMatchCSVFileTables fire 0</value>
</process>
<process id="_p4">
<value>.load-for-opt-query3.StopOnFalse2 fire 0</value>
</process>
<process id="_p5">
<value>.load-for-opt-query3.CreateEmptyLoadDB fire 0</value>
</process>
<process id="_p6">
<value>.load-for-opt-query3.Array Permute fire 0</value>
</process>
<process id="_p8">
<value>.load-for-opt-query3.ForEach.in fire 0</value>
</process>
<process id="_p10">
<value>.load-for-opt-query3.ForEach.CompositeActor.in fire 1</value>
</process>
<process id="_p11">
<value>.load-for-opt-query3.ForEach.CompositeActor.Record Disassembler fire 0</value>
</process>
<process id="_p12">
<value>.load-for-opt-query3.ForEach.CompositeActor.IsExistsCSVFileFail fire 0</value>
</process>
Optional Query 10
For a workflow execution, determine the user inputs?
(: find all artifacts in a used relation, but not in a wasGeneratedBy
relation.
:)
let $used := $graph/causalDependencies/used/cause/@id
let $wasGeneratedBy := $graph/causalDependencies/wasGeneratedBy/effect/@id
(: find the difference :)
let $diff := distinct-values($used[not(.=$wasGeneratedBy)])
(: return the artifacts :)
return $graph/artifacts/artifact[@id=$diff]
Output:
<artifact id="0">
<value>"pc3/workflows/data/J062941"</value>
</artifact>
<artifact id="6">
<value>"J062941"</value>
</artifact>
<artifact id="8">
<value>"Record"</value>
</artifact>
Optional Query 11
For a workflow execution, determine steps that required user inputs?
(: get artifacts ids of user inputs (from optional query 10) :)
let $used := $graph/causalDependencies/used/cause/@id
let $wasGeneratedBy := $graph/causalDependencies/wasGeneratedBy/effect/@id
let $diff := distinct-values($used[not(.=$wasGeneratedBy)])
(: find processes that directly used these artifacts :)
return opmLib:getImmediateUsedProcessesForArtifactId($graph, $diff)
Output:
<process id="_p0">
<value>.load.IsCSVReadyFileExists fire 0</value>
</process>
<process id="_p2">
<value>.load.ReadCSVReadyFile fire 0</value>
</process>
<process id="_p5">
<value>.load.CreateEmptyLoadDB fire 0</value>
</process>
<process id="_p6">
<value>.load.Array Permute fire 0</value>
</process>
Suggested Workflow Variants
Suggested Queries
Suggestions for Modification of the Open Provenance Model
Conclusions
--
DanielCrawl - 31 Mar 2009
to top