Provenance Challenge: NCSA
Participating Team
Team and Project Details
- Short team name: NCSA
- Participant names: Joe Futrelle, Jim Myers, Robert Clark
- Project Url: http://leovip217.ncsa.uiuc.edu/
- Project Overview:
- Relevant Publications:
Workflow Representation
We chose to implement the workflow by means of the given Java code. This decision was ideal to enable the use of
Tupelo, a semantic content repository framework being developed at NCSA, to record provenance records. In addition to a Java
OPM binding, Tupelo provides a dedicated provenance API which allowed us to weave our code to record provenance records directly into the workflow code. The use of this provenance API records
OPM compliant records while abstracting away from the user the means by which these observations are recorded.
Approach to Recording Provenance
To make sense out of our
OPM output and graphs in the following sections, we will describe our
OPM naming scheme and our approach to recording provenance records for this challenge.
Recording Guidelines
The following rules guided what we recorded.
- All of the code to record provenance information will be contained in LoadWorkflow?.java.
- Each new object created in LoadWorkflow?.java will be treated as a provenance artifact until that object's state is modified, then the same object, now in a new state, will be treated as a new artifact.
- Each conditional, loop iteration and method called in LoadWorkflow?.java will be treated as a process.
- We say that a successful conditional triggers the next process in the workflow (e.g. IsCSVReadyFileExistsConditional? triggers ReadCSVReadyFileProcess?).
- We will also say that each loop iteration process triggers the first process inside the body of the loop (e.g. LoopIteration0? triggers IsExistsCSVFile0?).
- The first iterator process is triggered by the CreateEmptyLoadDB? process, since after returning from CreateEmptyLoadDB? we enter the loop structure.
- If the conditional within the for each is false (i.e. the loop ends) that iterator process triggers the CompactDatabase? process. In this workflow's case, Iterator3 triggers CompactDatabase? since the for each loop runs 3 times.
- We say that the last process in the body of the loop (IsMatchTableColumnRanges?) triggers the next iterator process (e.g. IsMatchTableColumnRangesProcess0? triggers Iterator1).
Naming Scheme
- A process entity representing a method call is given the same name as the method call with Process appended as a suffix, e.g. a call to IsCSVReadyFileExists? is represented by an RDF triple with subject IsCSVReadyFileExistsProcess?, or an OPM XML element with id IsCSVReadyFileExistsProcess?.
- An artifact entity representing an object generated by a method call or user's input is given the same name as the object with Aritfact appended, e.g. the user's input JobID? string is represented by an RDF triple about subject JobIDArtifact?, or OPM XML element with id JobIDArtifact?.
- Each iteration of the loop is represented by an RDF triple about subject IteratorN?, where N is the number of completed iterations so far.
- If an entity is created within the loop, then we use the same naming scheme as entites not in the loop, but with the number of completed iterations appended the the end of the name. E.g. there will be RDF triples about IsExistsCSVFileOutputArtifact0?, IsExistsCSVFileOutputArtifact1? and IsExistsCSVFileOutputArtifact2? and OPM XML entities with the same id.
A small piece of code with lots of comments explaining the process we used when recording a single
OPM event in the work flow execution is available
here.
Additionally, a piece of code illustrating our approach to recording provenance about the loop structure of the work flow can be found
here.
Open Provenance Model Output
The
OPM output for a successful run of the workflow with each dataset is available in the table below. Additionally, we have developed a tool that takes
OPM output generated by Tupelo and generates a gif file of the
OPM graph. This tool is similar to the opm2dot tool available in the opm toolbox but the input is not formatted in the
OPM XML schema. The graph produced by this tool for a successful run of the J602941 dataset can be found
here.
Query Results
Core Query 1
Our approach to this query is as follows. For a given detection, we can use tools provided by Tupelo to associate with that table all artifacts on which the detection causally depend that are of RDF type CSV_file with a
PathToFile? property. The following code accomplishes this.
Unifier u = new Unifier();
u.setColumnNames("file", "path");
u.addPattern("file", Rdf.TYPE, PC3Utilities.ns("CSV_file"));
u.addPattern("file", PC3Utilities.ns("PathToFile"), "path");
context.perform(u);
for(Tuple<Resource> r : u.getResult()) {
System.out.println(r);
}
This yields the following output.
[http://pc3#FileEntryArtifact1, C:\dev\workspace\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2ImageMeta.csv]
[http://pc3#FileEntryArtifact2, C:\dev\workspace\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2Detection.csv]
[http://pc3#FileEntryArtifact0, C:\dev\workspace\PC3\SampleData/J062941\P2_J062941_B001_P2fits0_20081115_P2FrameMeta.csv]
Core Query 2
In the interest of answering this query, every time an
IsMatchTableColumnRanges? check is performed on a particular table, we assert within the execution of the workflow that the
IsMatchTableColumnRanges? process has a "PerformedOn" predicate with the table name as the object.
Then, to answer the query, we accept a string representing the table name in question and use Tupelo's Unifier tool to find the process that was "PerformedOn" that table.
boolean checkPerformed = false;
Unifier u = new Unifier();
u.setColumnNames("process");
u.addPattern("process", PC3Utilities.ns("PerformedOn"), Resource.literal(s));
context.perform(u);
if(u.getResult().iterator().hasNext()) {
String process = u.getResult().iterator().next().get(0).toString();
assertTrue(process.substring(process.lastIndexOf("#")+1, process.length()-1).equals("IsMatchTableColumnRangesConditionals"));
checkPerformed = true;
}
if(checkPerformed) {
System.out.println("IsExistsCSVFile was performed on table "+s);
}
Answering the query in this way allows us to say that not only was the check performed on the table, but that the check was successful. The assertion should always be true since we recorded provenance such that any process with a "PerformedOn" property will be among the
IsMatchTableColumnRangesConditionals? processes.
Core Query 3
We took this query to mean: given an image table (an artifact) on which processes does this table causally depend?
With this understanding, we answered this query as follows. We used a Tupelo tool called a Transformer to query the graph and assert new relations on the results. In particular we created a new relation (removed the provenance API and the workflow execution) called "InferredTrigger."
Transformer t;
t = new Transformer();
t.addInPattern("t", Opm.TRIGGERED_PROCESS, "p1");
t.addInPattern("t", Opm.TRIGGERED_BY_PROCESS, "p2");
t.addOutPattern("p1", PC3Utilities.ns("InferredTrigger"), "p2");
context.perform(t);
context.addTriples(t.getResult());
t = new Transformer();
t.addInPattern("g", Opm.GENERATED_BY_PROCESS, "p2");
t.addInPattern("g", Opm.GENERATED_ARTIFACT, "a");
t.addInPattern("u", Opm.USED_ARTIFACT, "a");
t.addInPattern("u", Opm.USED_BY_PROCESS, "p1");
t.addOutPattern("p1", PC3Utilities.ns("InferredTrigger"), "p2");
context.perform(t);
context.addTriples(t.getResult());
This
InferredTrigger? relation is an arc from a process to a process on which the first process causally depends on the second. Once that relation was constructed we used another Tupelo tool called
TranstiveClosure? which requires two parameters: an entity called the
seed, and the relation on which to close. Once those two properties were set we closed the
InferredTrigger? relation on the
LoadCSVFileIntoTableProcess? as the seed.
TransitiveClosure tc;
tc = new TransitiveClosure();
tc.setSeed(PC3Utilities.ns("LoadCSVFileIntoTableProcess1"));
tc.setPredicate(PC3Utilities.ns("InferredTrigger"));
Set<Resource> result = tc.close(context);
for(Resource r : result) {
context.addTriple(PC3Utilities.ns("LoadCSVFileIntoTableProcess1"), PC3Utilities.ns("InferredTrigger"), r);
}
Finally, we returned all the processes which were related to
LoadCSVFileIntoTableProcess? by an
InferredTrigger? arc.
Unifier u = new Unifier();
u.setColumnNames("p");
u.addPattern(PC3Utilities.ns("LoadCSVFileIntoTableProcess1"), PC3Utilities.ns("InferredTrigger"), "p");
context.perform(u);
System.out.println("Suggested Query Five Results:");
for(Tuple<Resource> r : u.getResult()) {
System.out.println("\t"+r);
}
These processes are the answer to the query.
This is our output:
[http://pc3#IsCSVReadyFileExistsProcess]
[http://pc3#LoadCSVFileIntoTableConditional0]
[http://pc3#IsExistsCSVFileConditional1]
[http://pc3#LoadCSVFileIntoTableProcess0]
[http://pc3#IsExistsCSVFileProcess1]
[http://pc3#UpdateComputedColumnsConditional0]
[http://pc3#Iterator1]
[http://pc3#ReadCSVFileColumnNamesProcess0]
[http://pc3#IsMatchCSVFileTablesConditional]
[http://pc3#IsCSVReadyFileExistsConditional]
[http://pc3#UpdateComputedColumnsProcess0]
[http://pc3#IsMatchTableRowCountProcess0]
[http://pc3#IsExistsCSVFileProcess0]
[http://pc3#IsMatchCSVFileTablesProcess]
[http://pc3#ReadCSVFileColumnNamesProcess1]
[http://pc3#IsMatchCSVFileColumnNamesProcess1]
[http://pc3#IsMatchTableColumnRangesProcess0]
[http://pc3#IsMatchTableColumnRangesConditional0]
[http://pc3#IsMatchCSVFileColumnNamesProcess0]
[http://pc3#CreateEmptyLoadDBProcess]
[http://pc3#Iterator0]
[http://pc3#IsMatchCSVFileColumnNamesConditional0]
[http://pc3#ReadCSVReadyFileProcess]
[http://pc3#IsMatchCSVFileColumnNamesConditional1]
[http://pc3#IsExistsCSVFileConditional0]
[http://pc3#IsMatchTableRowCountConditional0]
Optional Query 1
We answered this query by first simulating an exception in the workflow execution and then recursively walk the graph looking for the "dead end" iteration process where the exception was thrown. To accomplish this, we called _findIterators (below) with findOne set to true. The findOne variable tells _findIterators to stop once the first entity of rdf type loop_iteration has been found. Then we call _findIterators again, this time with findOne set to false. Once this returns we will have all entites of rdf type loop_iteration. Finally, the answer to the query is the result from the first call of _findIterators removed from the result of the second call of _findIterators.
void _findIterators(GraphNode n, Set<Resource> beenThere, Set<Resource> result, boolean findOne) throws Exception{
Resource subject = n.getSubject();
if(beenThere.contains(subject) || (findOne && !result.isEmpty())) {
return;
}
beenThere.add(subject);
for(GraphEdge edge : n.getOutgoingEdges()) {
if(session.fetchThing(edge.getSink().getSubject()).getTypes().contains(PC3Utilities.ns("loop_iteration"))) {
result.add(edge.getSink().getSubject());
}
_findIterators(edge.getSink(),beenThere, result, findOne);
}
}
Note that one cannot simply select every entity of rdf type loop iteration from the Tupelo context and take that count minus one to be the answer to the query since it is possible to create many entities of rdf type loop_iteration long before they are used in the provenance trail.
Optional Query 3
To answer this query we used the time annotation support for provenance arcs provided by Tupelo. We first isolated the two used arcs
where the first arc represents the checking that
IsMatchCSVFileTablesOutput? is true and the second arc represents when a
IsExistsCSVFileConditional? failed.
Once we had those two arcs, answering this query was a matter of extracting the intervals of time both events were known to have occurred within and returning the upper and lower bound of time that may have elapsed between the two events.
ProvenanceContextFacade pcf = new ProvenanceContextFacade();
pcf.setContext(context);
ProvenanceArcSeeker s = new ProvenanceArcSeeker(pcf);
Collection<ProvenanceUsedArc> matchArcs = s.findUsedEdges(
pcf.getArtifact(PC3Utilities.ns("IsMatchCSVFileTablesArtifact")),
pcf.getProcess(PC3Utilities.ns("IsMatchCSVFileTablesConditional")),
null, pcf.getAccount(PC3Utilities.ns("account")));
assertTrue(matchArcs.size() == 1);
ObservedTime matchTime = ((ProvenanceRoleArc) matchArcs.toArray()[0]).getInterval();
Collection<ProvenanceUsedArc> existsArcs = s.findUsedEdges(
pcf.getArtifact(PC3Utilities.ns("IsExistsCSVFileArtifact1")),
pcf.getProcess(PC3Utilities.ns("IsExistsCSVFileConditional1")),
null, pcf.getAccount(PC3Utilities.ns("account")));
assertTrue(existsArcs.size() == 1);
ObservedTime existTime = ((ProvenanceRoleArc) existsArcs.toArray()[0]).getInterval();
long lowerBoundElapsed;
long upperBoundElapsed;
lowerBoundElapsed = existTime.getNoEarlier().getTime() - matchTime.getNoLater().getTime();
if(lowerBoundElapsed < 0 ) {
lowerBoundElapsed = 0;
}
upperBoundElapsed = existTime.getNoLater().getTime() - matchTime.getNoEarlier().getTime();
System.out.println("Somewhere between ["+ lowerBoundElapsed+", "+upperBoundElapsed+"] milliseconds elapsed" +
" between the successful IsMatchCSVFileTables and the unsuccessful IsExistsCSVFile.");
One a particular run of the workflow and query we generated the following output:
Somewhere between [9988, 10007] milliseconds elapsed between the successful IsMatchCSVFileTables? and the unsuccessful IsExistsCSVFile?.
Optional Query 9
In light of some of the discussion on the mailing list about this question we answer the following question: Suppose one run of the workflow halts and another runs to completion. Is an account contained in the halted graph contained in an account contained in the successful run?
To answer this we wrote a utility that takes as parameters two graphs, two accounts and two nodes. This utility returns true if an account contained in the first graph is contained in an account contained within the second. More precisely, beginning at the given nodes, one node in the first graph and one node in the second, continue if the account specific arcs on the first node is a subset of the account specific arcs of the second. This is where arc equality is defined as having the same source and sink and the same role, if applicable. Two nodes are defined as equal if they have the same name. To continue, recrursively follow each of the account specific arcs and verify that the set of account specific arcs of each node in the first graph is contained in the set of account specific arcs of each node from the second graphs. If we have visited each node in the first graph and the set of arcs on each node satsify the set containment requirement then the account contained in the first graph is a subgraph of the account contained in the second graph.
The code that accomplishes this can be found
here .
As a consequence of what we record when the workflow halts, and as we expected, it turns out that the provenance account given in a halted run of the workflow is indeed contained in the provenance account of a successful run of the workflow.
Optional Query 10
We have reevaluated our approach to this query. The reason for this is that our initial answer (which we leave posted below for reference) did not take into consideration
the open world assumption. By assuming that an artifact represents a user's input if those elements do not causally depend on anything within the given graph, then we implicitly assume knowledge about that artifact not contained in the graph.
The new approach to this query is as follows. We assert that a user's input is of rdf type User_Input and return provenance elements only of this type.
Set<Resource> result = new HashSet<Resource>();
Unifier u = new Unifier();
u.setColumnNames("artifact");
u.addPattern("artifact", Rdf.TYPE, PC3Utilities.ns("User_Input"));
context.perform(u);
for(Tuple<Resource> tuple : u.getResult()) {
System.out.println(tuple);
result.add(tuple.get(0));
}
return result;
Result:
http://pc3#CSVRootPathResource
http://pc3#JobIDresource
Our original approach to the query follows below.
To answer this query, we walk the graph down from the final entity produced by the work flow execution in this case the
DisplayPlotProcess? and keep any entities that do not
causally depend on any other entity in the graph (i.e. those graph nodes with no outgoing edges).
void findInputs(GraphNode n, Set<Resource> beenThere, Set<Resource> result) throws ModelException, IOException, OperatorException {
Resource subject = n.getSubject();
if(beenThere.contains(subject)) {
return;
}
beenThere.add(subject);
for(GraphEdge edge : n.getOutgoingEdges()) {
if(edge.getSink().getOutgoingEdges().isEmpty()) {
result.add(edge.getSink().getSubject());
}
findInputs(edge.getSink(),beenThere, result);
}
}
Result:
http://pc3#CSVRootPathResource
http://pc3#JobIDresource
Optional Query 11
We approached this query by first answering optional query 10, then on the result of that query we retrieved all used arcs terminating at one of those entities.
ProvenanceContextFacade pcf = new ProvenanceContextFacade();
Set<Resource> result = new HashSet<Resource>();
pcf.setContext(context);
Collection<ProvenanceUsedArc> arcs = new HashSet<ProvenanceUsedArc>();
for(Resource input : userInputs) {
arcs = pcf.getUsedBy(pcf.getArtifact(input));
for(ProvenanceUsedArc arc : arcs) {
result.add(((RdfProvenanceElement)arc.getProcess()).getSubject());
}
}
for(Resource r : result) {
System.out.println(r);
}
Result:
http://pc3#IsCSVReadyFileExistsProcess
http://pc3#CreateEmptyLoadDBProcess
http://pc3#ReadCSVReadyFileProcess
Suggested Workflow Variants
Suggested Queries
Suggestions for Modification of the Open Provenance Model
Conclusions
--
RobertClark - 11 May 2009
--
RobertClark - 13 Apr 2009
--
RobertClark - 04 Mar 2009
to top