Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.MyGrid

Start of topic | Skip to actions

Provenance Challenge Template

Participating Team

Workflow Representation

Our workflow is represented using the Scufl language. challengeWF.png Figure 1 – The Scufl Workflow for the challenge

Provenance Trace

Our provenance is recorded as RDF (Resource Description Framework). The schema is shared under http://cvs.mygrid.org.uk/cgi-bin/viewcvs.cgi/mygrid/miasgrid/rdf-provenance/etc/ontology/. The actual RDF provenance traces are shared under http://www.cs.man.ac.uk/~zhaoj/challenge. Details can be referred in [2].

Provenance Queries Matrix

Teams Queries
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
myGrid team thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up thumbs up

Provenance Queries

1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.


Input: the output data d named as "Atlas X Graphic" in a run r

Output: a set of process runs and data that led the d


Note, in answering this query, we constrain the scope of the query within a particular workflow run r in order to avoid presenting too many results, although this can be easily adapted as querying over the whole provenance repository, as shown in the first example.

From our understanding, we believe this query requires all the process runs and data that contribute to the creation of d during the run of r, directly or indirectly, should also be returned, rather than only those contribute directly. Thus, this query is realized in two steps:

  1. getting the identity of d with name as "Atlas X Graphic" in a run r,

ii. getting a set of process runs {pi} and {di}, each pi or di contributed to the creation of d.

Step 1 Get the identity of the output data name as “AtlasXGraphic” that is generated in the run r:

SELECT ?datalsid
WHERE <urn:lsid:www.mygrid.org.uk:experimentinstance:C3WK5J3HNW1> 
(?datalsid <http://www.mygrid.org.uk/provenance#outputDataHasName>
<urn:www.mygrid.org.uk/process#convert1_out_AtlasXGraphic>) 
USING ns FOR <http://www.mygrid.org.uk/provenance#>

In the following query we show how we can find all the d in the whole provenance repository rather than within one run r.

SELECT ?datalsid
WHERE 
(?datalsid <http://www.mygrid.org.uk/provenance#outputDataHasName>
<urn:www.mygrid.org.uk/process#convert1_out_AtlasXGraphic>) 
USING ns FOR <http://www.mygrid.org.uk/provenance#>

Result: this returns the d that named as “Atlas X Graphic” and created in the run “urn:lsid:www.mygrid.org.uk:experimentinstance:C3WK5J3HNW1”.


Get data urn:lsid:www.mygrid.org.uk:lsdocument:LZ25GL912Y49

Step 2

  1. Get the collection of data products Di = {d1, d2,…dn}, with d being derived from di and the process runs Pi = {p1, p2, …pn} that created each data product in Di.

b. Get the collection of data products Dj, with dj being derived from each di in Di and the process runs Pj that created each data product in Dj.

c. The query stops when all the data in Dj are input data to the workflow run.

The query for getting the data products that are directly derived from d is shown in the following:

SELECT ?value
WHERE ( <" + sourceIndividual
       + "> <" + myg:dataDerivedFrom + "> ?value ) "
USING ns FOR <http://www.mygrid.org.uk/provenance#>

Result: This query returns 27 data that d is derived from and 23 process runs that contribute to the creation of d, listed in Figure 2.

query2_result_table.png

Figure 2 – The result from provenance query 1

Figure 3 shows how this path can be visualized in the mygrid provenance browser from a process-oriented view. Figure 4 shows how the data path can be visualized as a DAG. This DAG shows how the Atlas X Graphic data, i.e. “LZ25GL912Y49”, is derived from

Here, the yellow colored data nodes are input to the workflow run and the blue colored data nodes are intermediate data products or workflow outputs.

query2_result_2.png

Figure 3 – The result in Figure 2 as visualized in the provenance browser

query2_result_3.PNG Figure 4 – The result in Figure 2 visualized as a DAG

2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.

Input: the output d named as “AtlasXGraphic” and the scope of the query as “after softmean”

Output: a set of process runs {pi} and data {di} that led the d


Similar to the Query1, we first found the identity of d with the name as “AtlasXGraphic”. In the second step, the result provenance logs are constrained to exclude all the logs about runs prior to “softmean”. The algorithm is shown in the following:

a. Get the collection of data products Di = {d1, d2,…dn}, with d being derived from di and the process runs Pi = {p1, p2, …pn} that created di in Di.

b. if di in Dj is created by running the process “softmean”, we will not retrieve data that are derived from di.

c. If Di contains only data that are created by “softmean”, the query stops; otherwise go to step d.

d. Get the collection of data products Dj, with dj being derived from each di in Di and the process runs Pj that created each data product in Dj.

e. query ends.

Result: This query returns the 4 data that d is derived from and the 3 process runs that contribute to the creation of d prior to the “softmean”.


The returned process runs: urn:www.mygrid.org.uk/process_run#1155043506390

urn:www.mygrid.org.uk/process_run#1155043506000

urn:www.mygrid.org.uk/process_run#1155043504343


the returned data:

urn:lsid:www.mygrid.org.uk:lsdocument:LZ25GL912Y46

urn:lsid:www.mygrid.org.uk:lsdocument:LZ25GL912Y44

urn:lsid:www.mygrid.org.uk:lsdocument:LZ25GL912Y21

urn:lsid:www.mygrid.org.uk:lsdocument:LZ25GL912Y45


Figure 5 shows how the data path can be visualized as a DAG.

figure2_result.png

Figure 5. The data view of the result of query 2.

3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.

By default, in the mygrid provenance browser users can navigate the complete provenance log of each workflow run. By using query 2, we find the set of process runs that contributed to the creation of Atlas X Graphic, i.e. a subset of the complete provenance log.

Figure 6 shows how this subset of logs can be viewed in the mygrid provenance browser, including all the details of these three process runs:

figure3_result.png

Figure 6. The process view of the returned provenance logs from query 3.

Note, the value of the intermediate data products are retrieved from the data store and we used pseudo values in the workflow runs.

4. Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model that ran on a Monday.

Input: the name of a process p as “align_warp”, the property of the parameter as “twelfth order”, the time of the run as “Monday”

Output: a set of process runs {pi} of the p with each pi satisfying the constraints by the two other inputs of the query, i.e. the parameter property and the date/time of the run.


The property of the parameter was treated as an annotation to the data. This annotation is attached to the workflow definition file using the Taverna knowledge provenance template. This query is realized in two steps. 1) Getting all the runs {ri} that were ran on Monday 2) In each ri, getting all the process runs that invoke the “align_warp” service with a parameter “twelfth order”

1) Firstly we get all the runs that were ran on a Monday by the following query: As our provenance logs include the timestamp information about each workflow or process run, we can answer this query by the following query.

SELECT ?p
WHERE (?p <http://www.mygrid.org.uk/provenance#startTime> ?time) AND (?time > date)
USING ns FOR <http://www.mygrid.org.uk/provenance#> 
   xsd FOR <http://www.w3.org/2001/XMLSchema#>

In our example repository, this query returned 2 workflow runs:


urn:lsid:www.mygrid.org.uk:experimentinstance:HXQOVQA2ZI0

urn:lsid:www.mygrid.org.uk:experimentinstance:JF9QILMXB60


2) In the second step we try to find in each of these workflow runs, the invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model.

SELECT ?p 
WHERE <urn:lsid:www.mygrid.org.uk:experimentinstance:HXQOVQA2ZI0>
(?p <http://www.mygrid.org.uk/provenance#runsProcess> ?processname . 
?p <http://www.mygrid.org.uk/provenance#processInput> ?inputParameter .
?inputParameter <ont:model> <ontology:twelfthOrder>) 
USING ns FOR <http://www.mygrid.org.uk/provenance#> 
      ont FOR <http://www.mygrid.org.uk/ontology#>

This returns two sets process runs:


Get runs urn:www.mygrid.org.uk/process_run#1155222243328

Get runs urn:www.mygrid.org.uk/process_run#1155222243578

Get runs urn:www.mygrid.org.uk/process_run#1155222242781

Get runs urn:www.mygrid.org.uk/process_run#1155222243062



Get runs urn:www.mygrid.org.uk/process_run#1155224291812

Get runs urn:www.mygrid.org.uk/process_run#1155224292453

Get runs urn:www.mygrid.org.uk/process_run#1155224292031

Get runs urn:www.mygrid.org.uk/process_run#1155224292250


5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility.


Input: a set of data {ind} with the named as “Anatomy Headers” which has an entry 4095

Output: a set of data {outd} named as “Atlas Graphic images” which are derived from the {ind}


In order to realize this query, we used the semantic metadata attached to the data products, as we did in the last query. Also to avoid presenting too many results, we choose to search for the target {outd} within a certain workflow run, although this query can be easily extended to query the overall provenance repository, as shown in the query 1.

We realize this query in two steps; 1)Getting all the identities for the {ind} 2)Getting all the {outd} which are indirectly derived from {ind} using the deep provenance algorithm.

1) In the first step, we try to find all the {ind}. The query is shown in the following:

SELECT ?data 
WHERE <urn:lsid:www.mygrid.org.uk:experimentinstance:HXQOVQA2ZI0> 
(?data <ont:global> ?value) AND (?value eq "ontology:4095") 
USING ns FOR <http://www.mygrid.org.uk/provenance#> 
ont FOR <http://www.mygrid.org.uk/ontology#>

This returns the data product:


Get data urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN10

This {ind} was produced in the run “urn:lsid:www.mygrid.org.uk:experimentinstance:HXQOVQA2ZI0” with the annotation showing that it has a maximum value 4095.

2) In the second, we try to find all the {outd} that are derived from {ind}. As {outd} is not directly derived from {ind}, but from some intermediate data products, including

The derivation relationship between {outd} and {ind} is not recorded directly in our provenance repository. Thus we use an algorithm named as “deep provenance” in order o find all the data products that are indirectly derived from another data product. The algorithm is as the following:

Given a input data ind or a set of input data {ind} and the target point for the derivation path, e,g. the type of data or the process that produced the data, the algorithm looks for the data indirectly derived from {ind} as the following: a). get a set of data {di} that is directly derived from {ind}

b). get a set of data {dj}, as each dj is directly derived from each di

c). the algorithm stops if the target point is reached

d). return {dk} 1<= i <= j <=k as the result {outd}.

Thus, the target Atlas Graphic images that are indirectly derived from “urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN10” is returned as:


urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN33

urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN32

urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN31


This can be simply verified by the little data derivation path shown in Figure 7. This figure shows which three Atlas Graphic images (CIEIR0DODN33, CIEIR0DODN32, CIEIR0DODN31) are derived from the header data “CIEIR0DODN10” and how they are derived from “CIEIR0DODN10” indirectly during a workflow run.

figure5_result.png

Figure 7.The data derivation path for the query 5.

6. Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12."


Input: a set of data {ind} that are input of “align warp” with the parameter “-m 12”

Output: a set of data {outd} that are output of “softmean” that are derived from {ind}


This query is similar to query 5. The target point for the “deep provenance” algorithm is changed to “output of softmean”. Thus two steps are performed to answer this query.

In the first step, we look for the identities of {ind} produced in a workflow run:

SELECT ?data 
WHERE <urn:lsid:www.mygrid.org.uk:experimentinstance:HXQOVQA2ZI0> 
(?p <http://www.mygrid.org.uk/provenance#processInput> ?data .
?data <ont:model> ?value) AND (?value eq "ontology:twelfthOrder") 
USING ns FOR <http://www.mygrid.org.uk/provenance#> 
   ont FOR <http://www.mygrid.org.uk/ontology#>

And this returned one input to the runs of a “align_warp” service, which used the argument “twelfthOrder”:


Get data urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN11

In the second step, we use the same deep provenance query to find the {outd} from “softmean” that is directly from {ind}. This returns two data products as the following:


urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN26

urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN27


This shows that this parameter contributed to the production of 2 outputs from the “softmean”. This can be verified by the data derivation path in Figure 8, where the two outputs from “softmean”, i.e. (CIEIR0DODN26 and CIEIR0DODN27), are derived from the parameter input to the “align_warp”, i.e. (CIEIR0DODN11).

figure6_result.png Figure 8. The data derivation path for the query 6.

7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.

In mygrid, this comparison can be done in three levels:

At the workflow design level: The difference between the two workflow definition files can be identified. This is done by comparing the services invoked during each workflow run, which are recorded in the provenance logs.

The result shows that:


difference urn:www.mygrid.org.uk/process#convert1

difference urn:www.mygrid.org.uk/process#convert3

difference urn:www.mygrid.org.uk/process#convert2

difference urn:www.mygrid.org.uk/process#pnmtojpeg1

difference urn:www.mygrid.org.uk/process#pnmtojpeg3

difference urn:www.mygrid.org.uk/process#pgmtoppm2

difference urn:www.mygrid.org.uk/process#pnmtojpeg2

difference urn:www.mygrid.org.uk/process#pgmtoppm3

difference urn:www.mygrid.org.uk/process#pgmtoppm1


At the process view of provenance: We can find out different levels of differences between these two runs, including:


Get process runs urn:www.mygrid.org.uk/process_run#1155132483171

Get process runs urn:www.mygrid.org.uk/process_run#1155132483312

Get process runs urn:www.mygrid.org.uk/process_run#1155132482984

Get process runs urn:www.mygrid.org.uk/process_run#1155132483250

Get process runs urn:www.mygrid.org.uk/process_run#1155132483093

Get process runs urn:www.mygrid.org.uk/process_run#1155132482921


At the data view of provenance:

We can find out the different data products that were produced in each run. As in mygrid, each data product is given a unique identity within each run. Two data products with the same data values but computed by the same process in different runs, i.e. equivalent data products, are given different identities. In the provenance logs, as only the data identities but not the data values are recorded, the data identities for the equivalent data products have to be processed before comparing the two logs.

This work has been published in our paper[2]. After the normalization process, we succeeded in finding the following different data products were produced by comparing one run with another:


urn:lsid:www.mygrid.org.uk:lsdocument:0XU3I4PS2Q34 urn:www.mygrid.org.uk/workflow#pnmtojpeg1_out_AtlasXGraphic

urn:lsid:www.mygrid.org.uk:lsdocument:0XU3I4PS2Q36 urn:www.mygrid.org.uk/workflow#pnmtojpeg3_out_AtlasZGraphic

urn:lsid:www.mygrid.org.uk:lsdocument:0XU3I4PS2Q32 urn:www.mygrid.org.uk/process#pgmtoppm2_out_PpmAtlasYGraphic

urn:lsid:www.mygrid.org.uk:lsdocument:0XU3I4PS2Q35 urn:www.mygrid.org.uk/workflow#pnmtojpeg2_out_AtlasYGraphic

urn:lsid:www.mygrid.org.uk:lsdocument:0XU3I4PS2Q33 urn:www.mygrid.org.uk/process#pgmtoppm3_out_PpmAtlasZGraphic

urn:lsid:www.mygrid.org.uk:lsdocument:0XU3I4PS2Q31 urn:www.mygrid.org.uk/process#pgmtoppm1_out_PpmAtlasXGraphic


8. A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago.

Input: the semantic annotation “center=UChicago”

Output: a set of data {outd} that are directly derived from data {ind} annotated with the semanticMarp.


This query is realized in two steps: 1). Getting the {ind} that are annotated with “center=UChicago” in a workflow run r; 2). Getting the {outd} that are derived from {ind} in this run r.

1) The query of the first step is shown as the following:

SELECT ?data 
WHERE <urn:lsid:www.mygrid.org.uk:experimentinstance:HXQOVQA2ZI0> 
(?data <ont:center> ?value) AND (?value eq "ontology:UChicago") 
USING ns FOR <http://www.mygrid.org.uk/provenance#> 
   ont FOR <http://www.mygrid.org.uk/ontology#>
This returns:
Get data urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN5

This data was created in the workflow run “urn:lsid:www.mygrid.org.uk:experimentinstance:HXQOVQA2ZI0” and annotated with “center=UChicago”.

Without scoping in which workflow we query, the query will be like:

SELECT ?data 
WHERE (?data <ont:center> ?value) AND (?value eq "ontology:UChicago") 
USING ns FOR <http://www.mygrid.org.uk/provenance#> 
   ont FOR <http://www.mygrid.org.uk/ontology#>

And this returns more than one target data as each of it was produced in different workflow runs:


Get data urn:lsid:www.mygrid.org.uk:lsdocument:N0OUDYG6JS11

Get data urn:lsid:www.mygrid.org.uk:lsdocument:WPVNL86SO56

Get data urn:lsid:www.mygrid.org.uk:lsdocument:JDD53ECRD65

Get data urn:lsid:www.mygrid.org.uk:lsdocument:1JBOYN02LN8

Get data urn:lsid:www.mygrid.org.uk:lsdocument:0XU3I4PS2Q7

Get data urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN5

Get data urn:lsid:www.mygrid.org.uk:lsdocument:0O00XFCTI15


2) In the second step, we try to find the {outd} that are derived from the {ind} in the run r.


Get data urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN17

This returns one data that is output from “align_warp”, whose input was annotated with “center=UChicago”.

9. A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.

Input: the semantic annotation “studyModality = speech”

Output: a set of data {outd} that are annotated with the semanticMarp and the set of semantic annotations about {outd}.


The first part of the query is the same as query 8:

SELECT ?data 
WHERE <urn:lsid:www.mygrid.org.uk:experimentinstance:HXQOVQA2ZI0> 
(?data <ont:studyModality> ?value) AND (?value eq "ontology:speech") 
 USING ns FOR <http://www.mygrid.org.uk/provenance#> 
   ont FOR <http://www.mygrid.org.uk/ontology#>

This returns:


Get data urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN32

In the second part of the query, as all the semantic annotations are sub properties of the property “userDefinedPredicate” in our provenance ontology, in order to find other semantic annotations attached to a data product, we only need to retrieve all the user defined predicates and all the RDF typing information about this data product.

The result returns the following other semantic annotations about this data product “CIEIR0DODN32”, which tells that “CIEIR0DODN32” is a type of “Atlas Graphic”, an atomic data object and a data object.


urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN32 rdf:type ontology:AtlasGraphic

urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN32 rdf:type http://www.mygrid.org.uk/provenance#DataObject

urn:lsid:www.mygrid.org.uk:lsdocument:CIEIR0DODN32 rdf:type http://www.mygrid.org.uk/provenance#AtomicData


Suggested Wokflow Variants

As shown by the example workflow of this provenance challenge, the same service was invoked more than once in one workflow run, e.g. the align warp, reslice, slicer and convert service. In Taverna, in order to realize this abstract workflow, we had created three instances of each such service at each stage.

This works fine when only three different inputs were processed in a workflow, as in the example. But in reality there are many chances that many more, e.g. thousands of inputs were processed by the same service in one workflow run, sometimes with different parameters. Therefore, we propose extending the example workflow with iterations and nested workflows, as already supported in the Taverna system:

challenge_mygrid2.png

Figure 9. Adopting implicit iterations.

challenge_mygrid.png

Figure 10. Using a nested workflow.

challenge_mygrid3.png

=Figure 11. Using interaction service. =

Suggested Queries

Categorisation of queries

Generally speaking, our provenance queries can be categorized into four levels:
  1. queries to support the provenance browser
  2. semantic queries
  3. integration queries
  4. pre-canned queries to support provenance usage scenarios.

Live systems

Further Comments

Conclusions

-- JunZhao - 29 Aug 2006
to top


End of topic
Skip to action links | Back to top

You are here: Challenge > FirstProvenanceChallenge > ParticipatingTeams > MyGrid

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.