Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.NcsaD2k

Start of topic | Skip to actions

NCSA Provenance Challenge, D2K

Participating Team

Workflow Representation

D2K - Data to Knowledge - is a rapid, flexible data mining and machine learning system that integrates analytical data mining methods for prediction, discovery, and deviation detection, with data and information visualization tools. [1] It offers a visual programming environment that allows users to connect programming modules together to build data mining applications and supplies a core set of modules, application templates, and a standard API for software component development. All D2K components are written in Java for maximum flexibility and portability.

The Provenance Challenge was implemented as a D2K itinerary which executes modules to call the application software for the problem. Each module collected the parameters and input names, constructs a command line, and calls the Java runtime to execute the applications. A graphic showing the itinerary is attached.

To implement the Provenance Challenge we created a test prototype version of the D2K infrastructure. The standard D2K infrastructure was modified to instrument the execution of itineraries. In the prototype, the infrastructure constructs RDF triples to describe the static dataflow graph (see graphic) and also to record the execution of each module. The modules and the itinerary were not modified and contained no special code to provide provenance. The instrumentation is completely generic, it produces this metadata for any itinerary.

1. http://alg.ncsa.uiuc.edu/do/tools/d2k

Provenance Trace

D2K generates a text file encoding RDF statements produced during the execution of an itinerary. The file is not in any standard RDF serialization. The file generated for the example workflow is attached.

Provenance Queries

#1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.

To do this we need transitive closure on the property of one module having as input the output of another module, which we'll call "precedence". Kowari can only compute transitive closure per-predicate, so this needs to be collapsed into a single predicate as follows:

insert
select $this <http://d2k/precedes> $next
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$this <http://d2k/hasOutputPort> $op and 
$op <http://d2k/producesOutput> $of and
$next <http://d2k/hasInputPort> $ip and
$ip <http://d2k/hasInput> $of
into <rmi://badger.ncsa.uiuc.edu/server1#d2kpc>;

This query starts with the module that outputted atlas x graphic and finds all preceding modules. Note that because of how D2K represents inputs and outputs, we can only match on pathnames, not on true file identity.

select $s
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
($s <http://d2k/hasOutputPort> $op and
 $op <http://d2k/producesOutput> $out and
 $out <http://d2k/hasObjectValue> 'data/Provenance/atlas-x.gif')
or
($end <http://d2k/hasOutputPort> $op and
 $op <http://d2k/producesOutput> $out and
 $out <http://d2k/hasObjectValue> 'data/Provenance/atlas-x.gif' and
 ($s <http://d2k/precedes> $end
  or trans ($s <http://d2k/precedes> $end)));

To describe the process, we can return all triples on those modules:

select $m $p $o
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
(($m <http://d2k/hasOutputPort> $op and
  $op <http://d2k/producesOutput> $out and
  $out <http://d2k/hasObjectValue> 'data/Provenance/atlas-x.gif')
 or
 ($end <http://d2k/hasOutputPort> $op and
  $op <http://d2k/producesOutput> $out and
  $out <http://d2k/hasObjectValue> 'data/Provenance/atlas-x.gif' and
  ($m <http://d2k/precedes> $end
   or trans ($m <http://d2k/precedes> $end))))
and $m $p $o;

This is informative, but it's more informative when property key/value pairs are also returned. Here we exploit the fact that properties are not shared between modules to simplify our query.

select $s $p $o
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
(($m <http://d2k/hasOutputPort> $op and
  $op <http://d2k/producesOutput> $out and
  $out <http://d2k/hasObjectValue> 'data/Provenance/atlas-x.gif')
 or
 ($end <http://d2k/hasOutputPort> $op and
  $op <http://d2k/producesOutput> $out and
  $out <http://d2k/hasObjectValue> 'data/Provenance/atlas-x.gif' and
  ($m <http://d2k/precedes> $end
   or trans ($m <http://d2k/precedes> $end))))
and
(($m <http://d2k/hasProperty> $prop and
  $s <http://d2k/hasProperty> $prop and
  $s $p $o) or
  ($m <http://d2k/hasProperty> $s and
   $s $p $o));

#2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.

This query returns the id of the module that executes softmean, as well as all following modules:

select $s
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$softmean <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecSoftMean' and
$avantSoftmean <http://d2k/precedes> $softmean
and
($avantSoftmean <http://d2k/precedes> $s
 or trans ($avantSoftmean <http://d2k/precedes> $s));

Now we can constrain it as in query #1, to capture which of those modules contributed to Atlas X Graphic:

select $s
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$softmean <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecSoftMean' and
$avantSoftmean <http://d2k/precedes> $softmean
and
($avantSoftmean <http://d2k/precedes> $s
 or trans ($avantSoftmean <http://d2k/precedes> $s))
and (
($s <http://d2k/hasOutputPort> $op and
 $op <http://d2k/producesOutput> $out and
 $out <http://d2k/hasObjectValue> 'data/Provenance/atlas-x.gif')
or
($end <http://d2k/hasOutputPort> $op and
 $op <http://d2k/producesOutput> $out and
 $out <http://d2k/hasObjectValue> 'data/Provenance/atlas-x.gif' and
 ($s <http://d2k/precedes> $end
  or trans ($s <http://d2k/precedes> $end)))
);

Now we can get all triples on these modules and their properties, as in #1:

select $s $p $o
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$softmean <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecSoftMean' and
$avantSoftmean <http://d2k/precedes> $softmean
and
($avantSoftmean <http://d2k/precedes> $m or
 trans ($avantSoftmean <http://d2k/precedes> $m))
and (
($m <http://d2k/hasOutputPort> $op and
 $op <http://d2k/producesOutput> $out and
 $out <http://d2k/hasObjectValue> 'data/Provenance/atlas-x.gif')
or
($end <http://d2k/hasOutputPort> $op and
 $op <http://d2k/producesOutput> $out and
 $out <http://d2k/hasObjectValue> 'data/Provenance/atlas-x.gif' and
 ($m <http://d2k/precedes> $end
  or trans ($m <http://d2k/precedes> $end)))
)
and
(($m <http://d2k/hasProperty> $prop and
  $s <http://d2k/hasProperty> $prop and
  $s $p $o) or
  ($m <http://d2k/hasProperty> $s and
   $s $p $o));

#3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.

D2K doesn't have a concept of workflow "stages," so our knowledge of the strategy the author used is external to D2K and we need to add that information as annotations. We can characterize the stages as follows: stage 3 is the softmean stage, stage 4 is the slicer stage, and stage 5 is the convert stage. In the D2K workflow, there is a module between the softmean stage and the slicer stage whose purpose is to convert the single output of the softmean stage into three outputs that serve as inputs to the slicers. For the purposes of the challenge, we'll characterize this as part of stage 3.

The following query adds a predicate to all the stage 3, 4, and 5 modules describing which stage they're in. The query keys on the module class, because in the example workflow module class is sufficient to identify the modules. This is of course not true in the general case.

insert
select $mod <http://d2k/pc/inStage> $stage
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
($mod <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecSoftMean' and
 $stage <http://tucana.org/tucana#is> '3') or
($mod <http://d2k/moduleClass> 'ncsa.d2k.core.modules.FanOutModule' and
 $stage <http://tucana.org/tucana#is> '3') or
($mod <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecSlicer' and
 $stage <http://tucana.org/tucana#is> '4') or
($mod <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecConvert' and
 $stage <http://tucana.org/tucana#is> '5')
into <rmi://badger.ncsa.uiuc.edu/server1#d2kpc>;

This query retrieves the ids of all the modules in steps 3, 4, and 5, and the statements and properties associated with them:

select $s $p $o
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
($m <http://d2k/pc/inStage> '3' or
 $m <http://d2k/pc/inStage> '4' or
 $m <http://d2k/pc/inStage> '5')
and
(($m <http://d2k/hasProperty> $prop and
  $s <http://d2k/hasProperty> $prop and
  $s $p $o) or
  ($m <http://d2k/hasProperty> $s and
   $s $p $o));

#4. Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.

We can't quite answer this query from the execution trace data because in that data, the command line arguments are not separated from one another but appear together. But our technique would be no different if they had been split up. So for the purposes of the challenge, we will search for "-m 12 -q" instead of "-m 12".

iTQL doesn't support date arithmetic, so here we'll just match the align_warps and return timestamps along with them:

select $aw $start $end
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$aw <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecWarpAlign' and
$aw <http://d2k/startTime> $start and
$aw <http://d2k/endTime> $end and
$aw <http://d2k/hasProperty> $param and
$param <http://d2k/hasPropertyName> 'arguments' and
$param <http://d2k/hasPropertyValue> '-m 12 -q';

#5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility.

The workflow execution trace doesn't contain any information about header files (align_warp does not take them as arguments), nor does the workflow run scanheader (that's not part of the example workflow, so we didn't add it to our workflow). However we can infer the existence of a header file, extract the values if we have the file data handy, and add nodes to the execution trace containing header keys and values.

A problem with this is that the D2K workflow does not identify files. Rather, it identifies inputs and outputs, and the inputs in this workflow have pathnames as values. But a pathname does not identify a file; rather, it identifies a location that could hold different data at different times. We happen to know in this case that the files do not change once they exist. But there's no good place for us to put the header information in the graph. Should it be attached to the filename itself, or to the inputs? In this workflow there are multiple inputs that refer to the same file, but it makes the most sense to associate information about the file with the input. The result will be redundant information in the graph (the same information will be represented more than once if a file is used as an input for more than one module), and the solution would be to have D2K represent the files as non-terminal nodes in its execution trace instead of pathnames attached to inputs.

This query will get us the pathnames that are associated with inputs to warp_align:

select $in $file
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$module <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecWarpAlign' and
$module <http://d2k/hasInputPort> $ip and
$ip <http://d2k/hasInput> $in and
$in <http://d2k/hasObjectValue> $file;

Given the output of this query, we can scan the header files and produce RDF describing them with the following Perl script. Note that the script has to know a priori how to map local paths to the ones used in the workflow description. A better solution would be if each dataset had a globally unique id independent of where it is physically stored.

#!/usr/bin/perl
$AIR_BIN="AIR/bin";
$LOCAL_DATA_DIR="data";
$WORKFLOW_DATA_DIR="/home/mcgrath/D2Kb/data/Provenance";
$ix=10;
while(<>) {
    ($input,$workflowFile) = split /\t/;
    ($localFile = $workflowFile) =~ s/^${WORKFLOW_DATA_DIR}/${LOCAL_DATA_DIR}/;
    ($headerFile = $localFile) =~ s/\.img$/.hdr/;
    open S,"${AIR_BIN}/scanheader $localFile |";
    while(<S>) {
        chomp;
        next if /^$/;
        ($name,$value) = split /=/;
   $header="http://d2k/pc/header${ix}";
   print "<${input}> <http://d2k/pc/hasHeader> <${header}>\n";
        print "<${header}> <http://d2k/pc/hasHeaderName> '${name}'\n";
        print "<${header}> <http://d2k/pc/hasHeaderValue> '${value}'\n";
        $ix++;
    }
}

The script generates the following output which can be inserted directly into Kowari (note that this is formatted in iTQL and not a standard RDF serialization):

Now we can do the query ("Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095"). We know that align_warp takes anatomy images as inputs, so we can look at those inputs to see if they have matching header values, and find the associated modules:

select $module
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$header <http://d2k/pc/hasHeaderName> 'global maximum' and
$header <http://d2k/pc/hasHeaderValue> '4095' and
$in <http://d2k/pc/hasHeader> $header and
$ip <http://d2k/hasInput> $in and
$module <http://d2k/hasInputPort> $ip and
$module <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecWarpAlign';

Now we need to find all the atlas graphic images resulting from any of these modules. We walk from the modules with the files-of-interest as inputs until we hit a "convert" step, which we know has an atlas graphic as an output:

select $out $pathname
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$header <http://d2k/pc/hasHeaderName> 'global maximum' and
$header <http://d2k/pc/hasHeaderValue> '4095' and
$in <http://d2k/pc/hasHeader> $header and
$ip <http://d2k/hasInput> $in and
$module <http://d2k/hasInputPort> $ip and
$module <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecWarpAlign' and
trans ($module <http://d2k/precedes> $end) and
$end <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecConvert' and
$end <http://d2k/hasOutputPort> $op and
$op <http://d2k/producesOutput> $out and
$out <http://d2k/hasObjectValue> $pathname;

(which returns all three atlas graphic images).

#6. Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12."

we can answer this query by combining the conditions in the query with traversing the transitive closure of the precedence predicate (see #1).

select $out $img
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$softmean <http://d2k/hasOutputPort> $op and
$op <http://d2k/producesOutput> $out and
$out <http://d2k/hasObjectValue> $img and
$softmean <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecSoftMean' and
trans ($alignWarp <http://d2k/precedes> $softmean) and
$alignWarp <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecWarpAlign' and
$alignWarp <http://d2k/hasProperty> $param and
$param <http://d2k/hasPropertyName> 'arguments' and
$param <http://d2k/hasPropertyValue> '-m 12 -q';

#7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.

Not sure what this means. Graph diffs can be computed between the compute nodes and input/output edges, or statistics profiling distribution of execution times across the runs, parameters could be compared, etc.

#8: A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago.

As noted in #5, there are no non-terminal nodes representing files in the execution trace, just non-terminal nodes representing inputs and outputs. There may be more than one input/output with the same pathname, so we can't really identify the images. However, we can annotate any input or output with a pathname we recognize as one of the anatomy images. To find the inputs and outputs associated with "anatomy1.img" or "anatomy3.img", we can do this query:

select $io
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$io <http://d2k/hasObjectValue> '/home/mcgrath/D2Kb/data/Provenance/anatomy1.img' or
$io <http://d2k/hasObjectValue> '/home/mcgrath/D2Kb/data/Provenance/anatomy3.img';

Given the output of this query, we can insert annotations. For example:

insert
<tag:provenance@ncsa.uiuc.edu,2006-08-29:c89b226a9e0d036969550a4b04e2ee21eef57bc1> <http://d2k/pc/hasAnnotation> $ann1
$ann1 <http://d2k/pc/hasAnnotationName> 'center'
$ann1 <http://d2k/pc/hasAnnotationValue> 'UChicago'
<tag:provenance@ncsa.uiuc.edu,2006-08-29:83c55840c30a113cae3ad80ddfe16615f9e31aa1> <http://d2k/pc/hasAnnotation> $ann2
$ann2 <http://d2k/pc/hasAnnotationName> 'center'
$ann2 <http://d2k/pc/hasAnnotationValue> 'UChicago'
into <rmi://badger.ncsa.uiuc.edu/server1#d2kpc>;

Now we can perform the query:

select $of $pathname
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$pi <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecWarpAlign' and
$pi <http://d2k/hasInputPort> $ip and
$ip <http://d2k/hasInput> $if and
$pi <http://d2k/hasOutputPort> $op and
$op <http://d2k/producesOutput> $of and
$of <http://d2k/hasObjectValue> $pathname and
$if <http://d2k/pc/hasAnnotation> $ann and 
$ann <http://d2k/pc/hasAnnotationName> 'center' and
$ann <http://d2k/pc/hasAnnotationValue> 'UChicago';

#9. A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.

From the workflow, we can infer that if a file is the output of "convert", it's an atlas graphic. That amounts to this query:

select $file $pathname
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$module <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecConvert' and
$module <http://d2k/hasOutputPort> $op and
$op <http://d2k/producesOutput> $file and
$file <http://d2k/hasObjectValue> $pathname;

Now we need to add the annotations, using the same strategy as query #8:

For atlas x:

insert
<tag:provenance@ncsa.uiuc.edu,2006-08-29:2c81375a7979ff708bead27db5c58ee2ea6b4519> <http://d2k/pc/hasAnnotation> $ann1
$ann1 <http://d2k/pc/hasAnnotationName> 'studyModality'
$ann1 <http://d2k/pc/hasAnnotationValue> 'speech'
<tag:provenance@ncsa.uiuc.edu,2006-08-29:2c81375a7979ff708bead27db5c58ee2ea6b4519> <http://d2k/pc/hasAnnotation> $ann2
$ann2 <http://d2k/pc/hasAnnotationName> 'foo'
$ann2 <http://d2k/pc/hasAnnotationValue> 'bar'
<tag:provenance@ncsa.uiuc.edu,2006-08-29:2c81375a7979ff708bead27db5c58ee2ea6b4519> <http://d2k/pc/hasAnnotation> $ann3
$ann3 <http://d2k/pc/hasAnnotationName> 'foo'
$ann3 <http://d2k/pc/hasAnnotationValue> 'quux'
into <rmi://badger.ncsa.uiuc.edu/server1#d2kpc>;

For atlas y:

insert
<tag:provenance@ncsa.uiuc.edu,2006-08-29:b6caca535836f27ef2622a88bba9095eae36ed53> <http://d2k/pc/hasAnnotation> $ann1
$ann1 <http://d2k/pc/hasAnnotationName> 'studyModality'
$ann1 <http://d2k/pc/hasAnnotationValue> 'tactile'
<tag:provenance@ncsa.uiuc.edu,2006-08-29:b6caca535836f27ef2622a88bba9095eae36ed53> <http://d2k/pc/hasAnnotation> $ann2
$ann2 <http://d2k/pc/hasAnnotationName> 'foo'
$ann2 <http://d2k/pc/hasAnnotationValue> 'fnord'
into <rmi://badger.ncsa.uiuc.edu/server1#d2kpc>;

For atlas z:

insert
<tag:provenance@ncsa.uiuc.edu,2006-08-29:58214dfcc108fbdb1705d2c81de043ef599a5dff> <http://d2k/pc/hasAnnotation> $ann1
$ann1 <http://d2k/pc/hasAnnotationName> 'studyModality'
$ann1 <http://d2k/pc/hasAnnotationValue> 'visual'
into <rmi://badger.ncsa.uiuc.edu/server1#d2kpc>;

Now we can perform the query. The subquery produces a nested table which groups the annotation key/value pairs by which output they're associated with.

select $out
  subquery(select $name $value
           from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
           $out <http://d2k/pc/hasAnnotation> $otherAnn and
           $otherAnn <http://d2k/pc/hasAnnotationName> $name and
           $otherAnn <http://d2k/pc/hasAnnotationValue> $value)
from <rmi://badger.ncsa.uiuc.edu/server1#d2kpc> where
$module <http://d2k/moduleClass> 'ncsa.d2k.modules.core.util.ExecConvert' and
$module <http://d2k/hasOutputPort> $op and
$op <http://d2k/producesOutput> $out and
$out <http://d2k/pc/hasAnnotation> $ann and
$ann <http://d2k/pc/hasAnnotationName> 'studyModality' and
($ann <http://d2k/pc/hasAnnotationValue> 'speech' or
$ann <http://d2k/pc/hasAnnotationValue> 'audio' or
$ann <http://d2k/pc/hasAnnotationValue> 'visual');

Suggested Workflow Variants, Suggested Queries

This workflow could have been implemented in D2K in a number of ways. For instance instead of module classes being used to distinguish executable steps, a generalized external executable module class could have been used for which the pathname of the executable was a property of the module.

The itinerary could also have been implemented as an arbitrarily-nested multi-level itinerary, and the levels of hierarchy could be significant in various ways. For instance they could be used to represent workflow "stages" as in the example workflow description.

Categorization of queries

Leaving aside a generalized taxonomy of provenance queries, there were two primary kinds of clauses required to answer the challenge queries.

Graph-walking with transitivity is required to construct declarative queries to answer some of the structural conditions in the challenge queries; iTQL is ideal for that case.

-- JoeFutrelle - 12 Sep 2006
to top


End of topic
Skip to action links | Back to top

I Attachment sort Action Size Date Who Comment
pc_d2k_headers.txt manage 18.6 K 12 Sep 2006 - 00:50 JoeFutrelle Triples inserted containing header data
queries_d2k.html manage 96.3 K 12 Sep 2006 - 01:29 JoeFutrelle Queries and result tables in HTML
d2k-provenance.txt manage 80.4 K 12 Sep 2006 - 02:26 JoeFutrelle D2K output for the example run
pc_d2k_screenshot.jpg manage 148.9 K 12 Sep 2006 - 02:34 JoeFutrelle Screenshot of D2K itinerary
pc_d2k_impl.tgz manage 8.0 K 12 Sep 2006 - 16:24 JoeFutrelle D2K implementation of example workflow including itinerary files and source code

You are here: Challenge > FirstProvenanceChallenge > ParticipatingTeams > NcsaD2k

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.