Provenance Challenge Template
Participating Team
- Short team name: ZOOM, Database Group, University of Pennsylvania
- Participant names: Sarah Cohen-Boulakia, Shirley Cohen, Susan Davidson, Olivier Biton and Thunyarat (Bam) Amornpetchkul
- Project URL: http://db.cis.upenn.edu/research/provwf.html
- Project Overview: This project aims to provide a formal model of provenance for scientific workflows which is both simple and general (i.e. can be used with existing workflow systems, such as Ptolemy/Kepler and myGrid) and sufficiently expressive to answer the provenance queries encountered in case studies. Interestingly, the proposed model not only takes into account the chained and complex structures of scientific workflows, but also allows for reasoning about provenance at different levels of abstraction through user views.
- Provenance-specific Overview: See above.
- Relevant Publications: Shirley Cohen, Sarah Cohen-Boulakia and Susan Davidson. Towards a Model of Scientific workflows and User Views. Proceedings of DILS'06, Data Integration for the Life Sciences, Springer-Verlag, Lecture Notes in Bioinformatics (LNBI) Num. 4075, pp. 264-279.
Workflow Representation
Our aim is to provide a framework to represent minimal information necessary such that we can reason about provenance in scientific workflows. With this in mind, we distinguish between the specification of the workflow and the execution of the workflow. On the one hand, the workflow specification is composed of
step-classes (alternatively called procedures, tasks, actors, processes, boxes) such as
reslice in the challenge workflow. On the other hand, an execution of a workflow generates a partial order of
steps, each of which has a set of
input and
output data objects. Each step is an instance of a step-class, and the input-output flow of data and class associated with each step must conform to the workflow specification. An example of step in the challenge workflow is “8.reslice” which is an instance of the step-class “reslice”. The proposed model of provenance for workflows can represent the relationship between a step-class (specification) and a step (execution). This is necessary for keeping track of the data and parameters used by each step of a workflow execution as well as to trace the data produced. The model is represented within a relational framework extended with transitive closure. It is implemented under Oracle 10.g and Java. An object layer together with a user interface (JDBC component) is provided. More details on our model and implementation of the Challenge workflow are provided in the following section.
Provenance Trace
Our model is composed of base tables which provide minimal information about the workflow execution, the workflow specification, and the relationships between the specification and execution. The model is represented though the E/R diagram below. The tables IMMCONTAINS, IMMSTEPCONTAINS and USERVIEW will be defined in the section
Worflow Variants. The other tables and their schemas are listed below. We list the type of each column (VARCHAR are strings) and the primary keys (in bold). For each table, we give an example based on the Challenge workflow.
Figure 1 – E/R schema for the model of Provenance
Note: The content of the tables is provided at the end of this page (see images tablesData-1.jpg to tablesData-4.jpg).
Data (
ID int, name string, type string) relates the unique identifier of a data item to its name and its type.
As an example, the data whose ID is 1 has the name “Anatomy Image1” and is of type Anatomy Image.
DATAID NAME TYPE
-----------------------------------------------------
1 Anatomy Image1 Anatomy Image
dataAttributes (dataId string, attribute string, value string) contains the annotations of each data item. There can be several pairs of keys and values annotated with each data item. As an example, Anatomy Image1, whose ID is 1, is annotated with a key word “center” and value “UChicago.” In other words, center= UChicago associated with DATAID 1, is represented in our model in the following way:
DATAID ATTRIBUTE VALUE
-----------------------------------------------------
1 center UChicago
stepParam(step int, attribute string, value string) stores information about the parameters used by a step.
step is the id of the step,
attribute represents the category of the step, and
value represents the description of the category.
As an example, the table below expresses that the step whose id is 1 takes a nonlinear (linear=false), 12th order (order=12) parameter, and whose model is 1365 (model=1365). This parameter can be invoked by using “-m 12 –q” (description=-m 12 –q).
STEP ATTRIBUTE VALUE
-----------------------------------------------------
1 linear false
1 order 12
1 model 1365
1 description -m 12 -q
input(
step int,
dataId int, ts Date) stores information about an input to a step and the timestamp it occurred.
As an example, Step 2 (2.align_warp) takes as inputs data whose ID is 3 (Anatomy Image2), 4 (Anatomy Header2), 9 (Reference Image), and 10 (Reference Header). Each input in the example was taken on the same date, 8/8/2006.
STEP DATAID TS
-----------------------------------------------------
2 3 8/8/2006
2 4 8/8/2006
2 9 8/8/2006
2 10 8/8/2006
output(
step int,
dataId int, ts Date) stores information about an output from a step and the timestamp it occurred.
As an example, Step 10 (10. slicer) produced one output with a DATAID of 25 (Atlas X Slice) on 08/17/2006.
STEP DATAID TS
-----------------------------------------------------
10 25 8/17/2006
instanceOf (
step int, step-class string, ts) defines the relationship between a step-class and a step of which it is an execution, and contains the execution time of the step.
As an example, steps 1, 2, 3, and 4 are executions of the step-class “align_warp.”
STEP STEPCLASS TS
-----------------------------------------------------
1 align_warp 8/7/2006
2 align_warp 8/8/2006
3 align_warp 8/10/2006
4 align_warp 8/11/2006
Note: The notion of stage was not initially in our model and we think it needs to be better defined. We will go back to this point. However, we added it to be able to answer some of the queries in the challenge.
stageInstance(
step string, stage string) stores the relationship between a stage and a step. As an example, steps 1 through 4 are in stage 1.
STEP STAGE
-----------------------------------------------------
1 1
2 1
3 1
4 1
This table is automatically generated from the previous tables. It represents the details of each execution including the step, step-class, input, output, parameter, and starting time.
CREATE VIEW Process
AS
SELECT DISTINCT
instanceOf.step step,
instanceOf.stepClass stepClass,
input.dataId input,
dataInput.name inputName,
output.dataId output,
dataOutput.name outputName,
input.ts time
FROM
instanceOf,
input,
data dataInput,
output,
data dataOutput
WHERE instanceOf.step = input.step
AND input.dataId = dataInput.dataId
AND input.step = output.step
AND output.dataId = dataOutput.dataId
AND input.ts <= output.ts
As an example, step 13 is an execution of the step-class
convert. It takes an input of DATAID 25 (Atlas X Slice) and was executed on 08/20, 2006. It does not take any parameters. It produces the output of DATAID 28 (Atlas X Graphic).
STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME TIME
----------------------------------------------------------------
13 convert 25 Atlas X Slice 28 Atlas X Graphic 8/20/2006
Provenance Queries
Teams | Queries |
Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | Q9 |
UPenn team | | | | | | | | | |
For each query, we provide the corresponding SQL query and the results obtained. We also provide these results in a user-friendly format (as screenshots of our Java user interface).
- Query1
- Query: Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName
FROM process
START WITH output =
(
SELECT dataId FROM data WHERE name = 'Atlas X Graphic'
)
CONNECT BY PRIOR input = output
ORDER BY step
-
- Note: This query exploits the Connect by prior operator of Oracle which implements the transitive closure function. Transitive closure is necessary for returning all the data (recursively) that have been used to compute Atlas X Graphic. Basically, we use the Process table and search reccursively for the data used (output produced and input used, by mean of "connect by prior input=output").
- Result (SQL):
STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME
----------------------------------------------------------------
1 align_warp 1 Anatomy Image1 11 Warp Parameters1
1 align_warp 10 Reference Header 11 Warp Parameters1
1 align_warp 2 Anatomy Header1 11 Warp Parameters1
1 align_warp 9 Reference Image 11 Warp Parameters1
10 slicer 23 Atlas Image 25 Atlas X Slice
10 slicer 24 Atlas Header 25 Atlas X Slice
13 convert 25 Atlas X Slice 28 Atlas X Graphic
2 align_warp 10 Reference Header 12 Warp Parameters2
2 align_warp 3 Anatomy Image2 12 Warp Parameters2
2 align_warp 4 Anatomy Header2 12 Warp Parameters2
2 align_warp 9 Reference Image 12 Warp Parameters2
3 align_warp 10 Reference Header 13 Warp Parameters3
3 align_warp 5 Anatomy Image3 13 Warp Parameters3
3 align_warp 6 Anatomy Header3 13 Warp Parameters3
3 align_warp 9 Reference Image 13 Warp Parameters3
4 align_warp 10 Reference Header 14 Warp Parameters4
4 align_warp 7 Anatomy Image4 14 Warp Parameters4
4 align_warp 8 Anatomy Header4 14 Warp Parameters4
4 align_warp 9 Reference Image 14 Warp Parameters4
5 reslice 11 Warp Parameters1 15 Resliced Image1
5 reslice 11 Warp Parameters1 16 Resliced Header1
6 reslice 12 Warp Parameters2 17 Resliced Image2
6 reslice 12 Warp Parameters2 18 Resliced Header2
7 reslice 13 Warp Parameters3 19 Resliced Image3
7 reslice 13 Warp Parameters3 20 Resliced Header3
8 reslice 14 Warp Parameters4 21 Resliced Image4
8 reslice 14 Warp Parameters4 22 Resliced Header4
9 softmean 15 Resliced Image1 23 Atlas Image
9 softmean 15 Resliced Image1 24 Atlas Header
9 softmean 16 Resliced Header1 23 Atlas Image
9 softmean 16 Resliced Header1 24 Atlas Header
9 softmean 17 Resliced Image2 23 Atlas Image
9 softmean 17 Resliced Image2 24 Atlas Header
9 softmean 18 Resliced Header2 23 Atlas Image
9 softmean 18 Resliced Header2 24 Atlas Header
9 softmean 19 Resliced Image3 23 Atlas Image
9 softmean 19 Resliced Image3 24 Atlas Header
9 softmean 20 Resliced Header3 23 Atlas Image
9 softmean 20 Resliced Header3 24 Atlas Header
9 softmean 21 Resliced Image4 23 Atlas Image
9 softmean 21 Resliced Image4 24 Atlas Header
9 softmean 22 Resliced Header4 23 Atlas Image
9 softmean 22 Resliced Header4 24 Atlas Header
The results can also be visualized within the user-friendly interface. On the top, the query is given by the user, and on the bottom the result (provenance) is represented through a graph. The user can access to the characteristics of a given step or data item by clicking on it (visualized on the right hand side).
-
- Result (user friendly view):
*
Query2
-
- Query: Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName
FROM Process
START WITH output = (
SELECT dataId FROM data WHERE name = 'Atlas X Graphic'
)
CONNECT BY PRIOR input = output
MINUS (
SELECT DISTINCT step, stepClass, input, inputName, output, outputName
FROM Process
START WITH output IN (
SELECT input FROM Process WHERE stepClass = 'softmean'
)
CONNECT BY PRIOR input = output
)
-
- Note: We follow the same process as previously but we remove (Minus operator) all the data used by "softmean".
- Result (SQL):
STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME
----------------------------------------------------------------
10 slicer 23 Atlas Image 25 Atlas X Slice
10 slicer 24 Atlas Header 25 Atlas X Slice
13 convert 25 Atlas X Slice 28 Atlas X Graphic
9 softmean 15 Resliced Image1 23 Atlas Image
9 softmean 15 Resliced Image1 24 Atlas Header
9 softmean 16 Resliced Header1 23 Atlas Image
9 softmean 16 Resliced Header1 24 Atlas Header
9 softmean 17 Resliced Image2 23 Atlas Image
9 softmean 17 Resliced Image2 24 Atlas Header
9 softmean 18 Resliced Header2 23 Atlas Image
9 softmean 18 Resliced Header2 24 Atlas Header
9 softmean 19 Resliced Image3 23 Atlas Image
9 softmean 19 Resliced Image3 24 Atlas Header
9 softmean 20 Resliced Header3 23 Atlas Image
9 softmean 20 Resliced Header3 24 Atlas Header
9 softmean 21 Resliced Image4 23 Atlas Image
9 softmean 21 Resliced Image4 24 Atlas Header
9 softmean 22 Resliced Header4 23 Atlas Image
9 softmean 22 Resliced Header4 24 Atlas Header
-
- Result (User friendly view):
- Query3
- Query: Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.
SELECT DISTINCT step, stepClass, input, inputName, output, outputName
FROM process
WHERE step IN (
SELECT step FROM StageInstance
WHERE (
stage in (3,4,5)
)
)
START WITH output = (
SELECT dataId FROM data WHERE name = 'Atlas X Graphic'
)
CONNECT BY PRIOR input = output
-
- Note: We follow the same idea than in query1 while focusing on stages 3, 4, and 5.
- Result (SQL):
STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME
----------------------------------------------------------------
9 softmean 19 Resliced Image3 23 Atlas Image
9 softmean 21 Resliced Image4 23 Atlas Image
9 softmean 17 Resliced Image2 24 Atlas Header
9 softmean 20 Resliced Header3 24 Atlas Header
9 softmean 15 Resliced Image1 24 Atlas Header
9 softmean 19 Resliced Image3 24 Atlas Header
13 convert 25 Atlas X Slice 28 Atlas X Graphic
9 softmean 15 Resliced Image1 23 Atlas Image
9 softmean 18 Resliced Header2 23 Atlas Image
9 softmean 16 Resliced Header1 23 Atlas Image
9 softmean 18 Resliced Header2 24 Atlas Header
9 softmean 17 Resliced Image2 23 Atlas Image
10 slicer 23 Atlas Image 25 Atlas X Slice
9 softmean 22 Resliced Header4 23 Atlas Image
9 softmean 16 Resliced Header1 24 Atlas Header
9 softmean 22 Resliced Header4 24 Atlas Header
9 softmean 20 Resliced Header3 23 Atlas Image
10 slicer 24 Atlas Header 25 Atlas X Slice
9 softmean 21 Resliced Image4 24 Atlas Header
-
- Result (User friendly view):
-
- Note: The same results as query2 have been found. Query 2 and 3 are synonyms!
- Query4
- Query: Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.
SELECT step, stepClass, input, inputName, output, outputName, time
FROM process
WHERE stepClass = 'align_warp'
AND EXISTS
(
SELECT value
FROM stepParams
WHERE stepParams.step = process.step
AND attribute='order'
AND value='12'
)
AND EXISTS
(
SELECT value
FROM stepParams
WHERE stepParams.step = process.step
AND attribute='model'
AND value='1365'
)
AND rtrim(to_char(time, 'DAY')) = 'MONDAY'
-
- Note: We search for information about the step-class "align-warp" (first "Where" clause) such that there is a step whose parameter is "order=12" and "model=1365".
- Result (SQL):
STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME TIME
----------------------------------------------------------------
1 align_warp 1 Anatomy Image1 11 Warp Parameters1 8/7/2006
1 align_warp 2 Anatomy Header1 11 Warp Parameters1 8/7/2006
1 align_warp 10 Reference Header 11 Warp Parameters1 8/7/2006
1 align_warp 9 Reference Image 11 Warp Parameters1 8/7/2006
-
- Result (User friendly view):
- Query5
- Query: Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095.
SELECT name
FROM data
WHERE dataId IN
(
SELECT output
FROM process
START WITH input IN (
SELECT data.dataId
FROM data, dataAttributes
WHERE data.dataId = dataAttributes.dataId
AND data.type = 'Anatomy Header'
AND dataAttributes.attribute = 'global maximum'
AND dataAttributes.value = '4095'
)
CONNECT BY PRIOR output=input
)
AND type = 'Atlas Graphic'
- Note: The dataAttributes table is used to find all Anatomy header data whose entry 'global maxiamum' is equal to 4095. Then, the same precess as previously is used (i.e. looking recursively for data).
- Result (SQL):
NAME
---------------------
Atlas X Graphic
Atlas Y Graphic
Atlas Z Graphic
- Query6
- Query: _Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align-warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align-warp procedure with argument -m 12." _
SELECT distinct pred.orig, data.name
FROM
(
SELECT distinct CONNECT_BY_ROOT step orig, step
FROM process
START WITH stepClass='softmean'
CONNECT BY PRIOR input=output
) pred,
output,
data
WHERE
EXISTS (
SELECT *
FROM stepParams
WHERE attribute = 'order'
AND value = 12
AND stepParams.step = pred.step
)
AND EXISTS
(
SELECT *
FROM stepParams
WHERE attribute = 'model'
AND value = 1365
AND stepParams.step = pred.step
)
AND pred.orig = output.step
AND output.dataId = data.dataId
Note that we understand "-m12" as a search for parameter whose order is equal to 12. We could also have chosen to search for a parameter whose description was like "-m 12".
ORIG NAME
---------------------------------
9 Atlas Header
9 Atlas Image
- Query7
- Query: A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.
- Note: Computing the difference between two workflows or two workflow executions will be possible in our next implementation. Query 7 is a very simple example of the problem. However, when the difference between the data used by a step or produced by a step is considered, the problem becomes much harder. We would be very interested in talking about this problem with the workshop attendees.
- Query8
- Query: A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align-warp where the inputs are annotated with center=UChicago.
SELECT distinct data.name
FROM process, dataAttributes, data
WHERE process.stepClass='align_warp'
AND process.input = dataAttributes.dataId
AND dataAttributes.attribute = 'center'
AND dataAttributes.value = 'UChicago'
AND process.output = data.dataId
-
- Note: This query basically consists in a join between dataAttributes (which contains information about annotations), Process, and Data.
- Result:
NAME
------------------
Warp Parameters2
Warp Parameters1
- Query9
- Query: A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.
SELECT distinct data.dataId, data.name, attribute, value
FROM data, dataAttributes, attribute, value
WHERE
data.type='Atlas Graphic'
AND data.dataId = dataAttributes.dataId
AND dataAttributes.attribute = 'studyModality'
AND dataAttributes.value IN ('speech', 'visual', 'audio')
-
- Note: Once again, this query consists in finding annotations in the dataAttributes table and to make joins to get other information.
- Result:
DATAID NAME ATTRIBUTE VALUE
---------------------------------------------------------------------
29 Atlas Y Graphic studyModality audio
29 Atlas Y Graphic studyModality visual
30 Atlas Z Graphic studyModality speech
Suggested Workflow Variants
The model we propose takes into account the capabilities of many of the current scientific workflow systems; one such capability is the idea of composite step-classes, that is, step-classes which are composed of step-classes. There are several reasons why composite step classes are useful in workflows. First, they help to manage the complexity of a workflow; users may wish to focus on a certain level of abstraction and ignore the lower levels of detail. Second, composite step-classes can also represent levels of ``authorization''; users without the appropriate clearance level would not be allowed to see the lower level executions of a step-class.
Typically, an input to a composite step-class is also an input to one or more of its substep classes while an output of a substep class is either an input to another substep class or becomes the output of a composite step class. Also, composite step-classes define a partition of the step-classes of the workflow (there is no step-class which belongs to multiple composite step-classes).
We have therefore defined the following notion of user views:
Given a workflow specification, the user view of a user (or class of users) U, userView(U), is the set of lowest-level step classes that U is entitled to see.
Note that a user view cannot contain two step classes such that one is contained in the other. We also assume that the user view is valid, i.e. that each of the highest-level step classes in the workflow specification is either in the view, or that at some lower-level all of its contained substeps are in the user view.
To answer questions of provenance, we must take the user view into account and reason about the input and output to steps which are instances of step-classes that are in the user view. We must know the containment relationship between step-classes and the relationship between each user and the step-classes he is entitled to see.
In the following, we introduce the part of our model which deals with user views and we will give examples based on a variation of the provenance workflow in the challenge. In the following discussion, we refer to composite steps as “boxes”.
Figure 2 – Workflow for the challenge with user views (boxes 1 to 3)
UserView Application
In this section, we introduce some additional tables which are needed for reasoning about provenance in the context of user views:
immContains(compoStepClass string,
stepClass string) indicates the immediate containment between step-classes.
As an example, box 1 covers all the step-class align_warp and reslice, and box3 contains box1.
COMPOSTEPCLASS STEPCLASS
---------------------------------
box1 align_warp
box1 reslice
box3 box1
For each step class, contains gives all the step classes it transitively contains.
CREATE VIEW
AS
SELECT DISTINCT CONNECT_BY_ROOT compoStepClass as compoStepClass, stepClass
FROM immcontains
CONNECT BY PRIOR stepClass=compoStepClass
ORDER BY compoStepClass
COMPOSTEPCLASS STEPCLASS
---------------------------------
box1 align_warp
box1 reslice
box3 box1
box3 align_warp
box3 reslice
The table instanceOf contains information for composite steps too.
For example stepBox1-1 is the first instance of the box box1; it will contain the steps 1 and 5 (cf. table immStepContains and view containsStep).
This table defines direct steps containment, accordingly with immContains but for instances.
For each step class, contains gives all the step classes it transitively contains.
CREATE VIEW
AS
SELECT DISTINCT CONNECT_BY_ROOT compoStep compoStep, step
FROM immStepContains
CONNECT BY PRIOR step=compoStep
ORDER BY compoStep
COMPOSTEP STEP
---------------------------------
stepBox1-1 1
stepBox1-1 5
stepBox1-2 2
stepBox1-2 6
stepBox1-3 3
stepBox1-3 7
stepBox1-4 4
stepBox1-4 8
stepBox2-1 10
stepBox2-1 13
stepBox2-2 11
stepBox2-2 14
stepBox2-3 12
stepBox2-3 15
stepBox3-1 1
stepBox3-1 10
stepBox3-1 13
stepBox3-1 5
stepBox3-1 9
stepBox3-1 stepBox1-1
stepBox3-1 stepBox2-1
userView(usr string, stepClass string) indicates the lowest step-class level that a user can see. Remember that the union of all the step-classes a user can see must cover the whole workflow (user views are valid).
As an example, the user ‘uBlackBox’ sees the whole workflow as one step. The user ‘uBio’ cannot see the detail of step-class align_warp and reslice separately (he sees box1 instead). He cannot see the details of step-class slicer or convert (he sees box2 instead). However, he can see the step-class softmean. UAdmin can see all step-classes.
USR STEPCLASS
--------------------------------
uBlackBox box3
uBio box1
uBio softmean
uBio box2
uAdmin align_warp
uAdmin reslice
uAdmin softmean
uAdmin slicer
uAdmin convert
The following two tables, Cinput and Coutput are input and output extended to composite steps.
Similarly to input, Cinput stores information about inputs to a composite step and the time of the input to the box within the context of a user view. Specifically, an input to a box is an input to one of the steps the box contains, but must not be an output of any step contained in the box.
CREATE VIEW cInput
AS
-- First union operand: "For each containment, the input is an input of
-- the higher level step class if and only if this input was not an output
-- from another of its sub steps
SELECT containsStep.compoStep step, input.dataId, min(input.ts) ts
FROM input,
containsStep
WHERE
input.step = containsStep.step
AND NOT EXISTS
(
SELECT o.step
FROM output o, containsStep c2
WHERE o.dataId = input.dataId
AND o.step = c2.step
AND c2.compoStep = containsStep.compoStep
)
GROUP BY containsStep.compoStep, input.dataId
-- Second union operand
-- We want to keep the direct input relations
UNION
SELECT step, dataId, ts
FROM input;
As an example, the step stepBox2-1 takes DATAID 23 (Atlas Image) and DATAID 24 (Atlas Header) as inputs.
STEP DATAID TS
----------------------------
stepBox2-1 23 8/17/2006
stepBox2-1 24 8/17/2006
Similarly to output, Coutput stores information of outputs produced from each composite step and the time of the output to the box within the context of a user view. An output from a box is an output from one of the steps the box contains, but must not be an input to any step contained in the box.
CREATE VIEW coutput
AS
-- First union operand: "For each containment, the output is an output of
-- the higher level step class if and only if this output is used as an
-- input for another step which is not contained in the same step
-- or if this is a final output of the workflow
SELECT containsStep.compoStep step, output.dataId, output.ts
FROM output,
containsStep
WHERE
output.step = containsStep.step
AND EXISTS
(
/* Step class is used as an entry of at least one step
* which is not a substep of this step class
*/
SELECT i.step
FROM input i
WHERE i.dataId = output.dataId
AND NOT EXISTS
(
SELECT cont2.compoStep
FROM containsStep cont2
WHERE cont2.step = i.step
AND cont2.compoStep = containsStep.compoStep
)
UNION
-- Step class is a final output?
(
SELECT output.dataId
FROM dual
MINUS
SELECT i.dataId
FROM input i
)
)
UNION
-- Second union operand
-- We want to keep the direct output relations
SELECT step, dataId, ts
FROM output
As an example, step stepBox2-1 produced data ID 28 (Atlas Graphic X) on 08/20/2006.
STEP DATAID TS
----------------------------
stepBox2-1 28 8/20/2006
Similarly to Process, UProcess represents the detail of each execution which a user can see.
CREATE VIEW UProcess
AS
SELECT DISTINCT
userView.usr usr,
instanceOf.step step,
instanceOf.stepClass stepClass,
input.dataId input,
dataInput.name inputName,
output.dataId output,
dataOutput.name outputName,
input.ts time
FROM
userView,
instanceOf,
cinput input,
data dataInput,
coutput output,
data dataOutput
WHERE
userView.stepClass = instanceOf.stepClass
AND instanceOf.step = input.step
AND input.dataId = dataInput.dataId
AND input.step = output.step
AND output.dataId = dataOutput.dataId
AND input.ts <= output.ts
As an example, let us consider three user views and their perspectives on stage 1 of the workflow (align_warp).
The user ‘uAdmin’ can see all the steps of the workflow; and thus can see all data.
The user ‘uBio’ sees align_warp and reslice calls as single boxes (step class box1), thus he cannot see data whose ids are 11 to 14 (Warp Params 1 to 4).
He also sees slicer and convert as single boxes (step class box2), thus he cannot see data with ids 25 to 27 (Atlas X Slice, Atlas Y Slice and Atlas Z Slice).
The user ‘uBlackBox’ cannot see any detail of the workflow. He sees everything as a big box which entries are data id 1 to 10 and outputs are data id 28 to 30.
The implications of user views when it comes to query answering, are discussed in the following section.
Suggested Queries
In this section, we answer query 1 considering user views and we introduce another query to emphasize the benefit of our approach.
- Query1 for each user
- Query for user “uBlackbox”
SELECT DISTINCT step, stepClass, input, inputName, output, outputName
FROM uProcess
WHERE usr='uBlackBox'
START WITH output = (
SELECT dataId FROM data WHERE name = 'Atlas X Graphic'
)
CONNECT BY PRIOR input = output
ORDER BY step;
STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME
stepBox3-1 box3 1 Anatomy Image1 23 Atlas Image
stepBox3-1 box3 1 Anatomy Image1 24 Atlas Header
stepBox3-1 box3 1 Anatomy Image1 28 Atlas X Graphic
stepBox3-1 box3 10 Reference Header 23 Atlas Image
stepBox3-1 box3 10 Reference Header 24 Atlas Header
stepBox3-1 box3 10 Reference Header 28 Atlas X Graphic
stepBox3-1 box3 17 Resliced Image2 23 Atlas Image
stepBox3-1 box3 17 Resliced Image2 24 Atlas Header
stepBox3-1 box3 17 Resliced Image2 28 Atlas X Graphic
stepBox3-1 box3 18 Resliced Header2 23 Atlas Image
stepBox3-1 box3 18 Resliced Header2 24 Atlas Header
stepBox3-1 box3 18 Resliced Header2 28 Atlas X Graphic
stepBox3-1 box3 19 Resliced Image3 23 Atlas Image
stepBox3-1 box3 19 Resliced Image3 24 Atlas Header
stepBox3-1 box3 19 Resliced Image3 28 Atlas X Graphic
stepBox3-1 box3 2 Anatomy Header1 23 Atlas Image
stepBox3-1 box3 2 Anatomy Header1 24 Atlas Header
stepBox3-1 box3 2 Anatomy Header1 28 Atlas X Graphic
stepBox3-1 box3 20 Resliced Header3 23 Atlas Image
stepBox3-1 box3 20 Resliced Header3 24 Atlas Header
stepBox3-1 box3 20 Resliced Header3 28 Atlas X Graphic
stepBox3-1 box3 21 Resliced Image4 23 Atlas Image
stepBox3-1 box3 21 Resliced Image4 24 Atlas Header
stepBox3-1 box3 21 Resliced Image4 28 Atlas X Graphic
stepBox3-1 box3 22 Resliced Header4 23 Atlas Image
stepBox3-1 box3 22 Resliced Header4 24 Atlas Header
stepBox3-1 box3 22 Resliced Header4 28 Atlas X Graphic
stepBox3-1 box3 9 Reference Image 23 Atlas Image
stepBox3-1 box3 9 Reference Image 24 Atlas Header
stepBox3-1 box3 9 Reference Image 28 Atlas X Graphic
- Results obtained in a user friendly format:
- Note: Only the first (DATAIDs 1 to 10) inputs of the workflow are seen by uBlackBox.
- Query for user “uBio”
- Query: As previously with WHERE users = ‘uBio’
- Result:
STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME
9 softmean 15 Resliced Image1 23 Atlas Image
9 softmean 15 Resliced Image1 24 Atlas Header
9 softmean 16 Resliced Header1 23 Atlas Image
9 softmean 16 Resliced Header1 24 Atlas Header
9 softmean 17 Resliced Image2 23 Atlas Image
9 softmean 17 Resliced Image2 24 Atlas Header
9 softmean 18 Resliced Header2 23 Atlas Image
9 softmean 18 Resliced Header2 24 Atlas Header
9 softmean 19 Resliced Image3 23 Atlas Image
9 softmean 19 Resliced Image3 24 Atlas Header
9 softmean 20 Resliced Header3 23 Atlas Image
9 softmean 20 Resliced Header3 24 Atlas Header
9 softmean 21 Resliced Image4 23 Atlas Image
9 softmean 21 Resliced Image4 24 Atlas Header
9 softmean 22 Resliced Header4 23 Atlas Image
9 softmean 22 Resliced Header4 24 Atlas Header
stepBox1-1 box1 1 Anatomy Image1 15 Resliced Image1
stepBox1-1 box1 1 Anatomy Image1 16 Resliced Header1
stepBox1-1 box1 10 Reference Header 15 Resliced Image1
stepBox1-1 box1 10 Reference Header 16 Resliced Header1
stepBox1-1 box1 2 Anatomy Header1 15 Resliced Image1
stepBox1-1 box1 2 Anatomy Header1 16 Resliced Header1
stepBox1-1 box1 9 Reference Image 15 Resliced Image1
stepBox1-1 box1 9 Reference Image 16 Resliced Header1
stepBox1-2 box1 10 Reference Header 17 Resliced Image2
stepBox1-2 box1 10 Reference Header 18 Resliced Header2
stepBox1-2 box1 3 Anatomy Image2 17 Resliced Image2
stepBox1-2 box1 3 Anatomy Image2 18 Resliced Header2
stepBox1-2 box1 4 Anatomy Header2 17 Resliced Image2
stepBox1-2 box1 4 Anatomy Header2 18 Resliced Header2
stepBox1-2 box1 9 Reference Image 17 Resliced Image2
stepBox1-2 box1 9 Reference Image 18 Resliced Header2
stepBox1-3 box1 10 Reference Header 19 Resliced Image3
stepBox1-3 box1 10 Reference Header 20 Resliced Header3
stepBox1-3 box1 5 Anatomy Image3 19 Resliced Image3
stepBox1-3 box1 5 Anatomy Image3 20 Resliced Header3
stepBox1-3 box1 6 Anatomy Header3 19 Resliced Image3
stepBox1-3 box1 6 Anatomy Header3 20 Resliced Header3
stepBox1-3 box1 9 Reference Image 19 Resliced Image3
stepBox1-3 box1 9 Reference Image 20 Resliced Header3
stepBox1-4 box1 10 Reference Header 21 Resliced Image4
stepBox1-4 box1 10 Reference Header 22 Resliced Header4
stepBox1-4 box1 7 Anatomy Image4 21 Resliced Image4
stepBox1-4 box1 7 Anatomy Image4 22 Resliced Header4
stepBox1-4 box1 8 Anatomy Header4 21 Resliced Image4
stepBox1-4 box1 8 Anatomy Header4 22 Resliced Header4
stepBox1-4 box1 9 Reference Image 21 Resliced Image4
stepBox1-4 box1 9 Reference Image 22 Resliced Header4
stepBox2-1 box2 23 Atlas Image 28 Atlas X Graphic
stepBox2-1 box2 24 Atlas Header 28 Atlas X Graphic
-
-
- Results obtained in a user friendly format:
-
-
- Note: For security reasons, uBio does not know that only DATAIDs 1, 2, 9 and 10 are used by 1.align_warp.
He thinks that DATAIDs 1 through 10 are necessary. The same can be said for the steps inside box1 and box2.
-
- Query for user “uAdmin”
- Query: As previously with WHERE users = ‘uAdmin’
- Result:
STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME
--------------------------------------------------------------------------
1 align_warp 1 Anatomy Image1 11 Warp Parameters1
1 align_warp 10 Reference Header 11 Warp Parameters1
1 align_warp 2 Anatomy Header1 11 Warp Parameters1
1 align_warp 9 Reference Image 11 Warp Parameters1
10 slicer 23 Atlas Image 25 Atlas X Slice
10 slicer 24 Atlas Header 25 Atlas X Slice
13 convert 25 Atlas X Slice 28 Atlas X Graphic
2 align_warp 10 Reference Header 12 Warp Parameters2
2 align_warp 3 Anatomy Image2 12 Warp Parameters2
2 align_warp 4 Anatomy Header2 12 Warp Parameters2
2 align_warp 9 Reference Image 12 Warp Parameters2
3 align_warp 10 Reference Header 13 Warp Parameters3
3 align_warp 5 Anatomy Image3 13 Warp Parameters3
3 align_warp 6 Anatomy Header3 13 Warp Parameters3
3 align_warp 9 Reference Image 13 Warp Parameters3
4 align_warp 10 Reference Header 14 Warp Parameters4
4 align_warp 7 Anatomy Image4 14 Warp Parameters4
4 align_warp 8 Anatomy Header4 14 Warp Parameters4
4 align_warp 9 Reference Image 14 Warp Parameters4
5 reslice 11 Warp Parameters1 15 Resliced Image1
5 reslice 11 Warp Parameters1 16 Resliced Header1
6 reslice 12 Warp Parameters2 17 Resliced Image2
6 reslice 12 Warp Parameters2 18 Resliced Header2
7 reslice 13 Warp Parameters3 19 Resliced Image3
7 reslice 13 Warp Parameters3 20 Resliced Header3
8 reslice 14 Warp Parameters4 21 Resliced Image4
8 reslice 14 Warp Parameters4 22 Resliced Header4
9 softmean 15 Resliced Image1 23 Atlas Image
9 softmean 15 Resliced Image1 24 Atlas Header
9 softmean 16 Resliced Header1 23 Atlas Image
9 softmean 16 Resliced Header1 24 Atlas Header
9 softmean 17 Resliced Image2 23 Atlas Image
9 softmean 17 Resliced Image2 24 Atlas Header
9 softmean 18 Resliced Header2 23 Atlas Image
9 softmean 18 Resliced Header2 24 Atlas Header
9 softmean 19 Resliced Image3 23 Atlas Image
9 softmean 19 Resliced Image3 24 Atlas Header
9 softmean 20 Resliced Header3 23 Atlas Image
9 softmean 20 Resliced Header3 24 Atlas Header
9 softmean 21 Resliced Image4 23 Atlas Image
9 softmean 21 Resliced Image4 24 Atlas Header
9 softmean 22 Resliced Header4 23 Atlas Image
9 softmean 22 Resliced Header4 24 Atlas Header
-
-
- Note: All the data are obtained (same results than for the very first version of query1).
- New Query: Find everything that caused “Resliced Image1” (ID 15) to be as it is?
SELECT *
FROM uProcess upc
WHERE usr = 'uBlackBox'
START WITH outputName = 'Resliced Image1'
CONNECT BY PRIOR upc.input = upc.output;
-
-
- Result: no rows selected (uBlackBox cannot even see ‘Resliced Image1’)
-
- Query for ‘uBio’
- Query: Idem with ‘uBio’
- Result:
USR STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME TIME
uBio stepBox1-1 box1 1 Anatomy Image1 15 Resliced Image1 8/7/2006
uBio stepBox1-1 box1 10 Reference Header 15 Resliced Image1 8/7/2006
uBio stepBox1-1 box1 2 Anatomy Header1 15 Resliced Image1 8/7/2006
uBio stepBox1-1 box1 9 Reference Image 15 Resliced Image1 8/7/2006
-
-
- Results obtained in a user friendly format:
-
- Query for ‘uAdmin’
- Query: Idem with ‘uAdmin’
- Result:
USR STEP STEPCLASS INPUT INPUTNAME OUTPUT OUTPUTNAME TIME
uAdmin 5 reslice 11 Warp Parameters1 15 Resliced Image1 8/12/2006
uAdmin 1 align_warp 1 Anatomy Image1 11 Warp Parameters1 8/7/2006
uAdmin 1 align_warp 10 Reference Header 11 Warp Parameters1 8/7/2006
uAdmin 1 align_warp 2 Anatomy Header1 11 Warp Parameters1 8/7/2006
uAdmin 1 align_warp 9 Reference Image 11 Warp Parameters1 8/7/2006
-
-
- Results obtained in a user friendly format:
Note that uAdmin can see that data id 11 (Warp Params 1) is part of the provenance while uBio cannot see it.
Categorisation of queries
Generally speaking, our provenance queries can be categorized as such:
- step vs data provenance: depending on whether the user is interested in knowing the step(s) or the data that have been used to produce a given data item.
- immediate vs deep provenance: depending on whether the user is interested in knowing only the previous step (and possibly the data used by this step) or the whole sequence of steps (and possibly the data used by these steps) used to produced a given data item.
According to our classification, all the queries proposed in this challenge are deep provenance queries, they are both step and data provenance queries.
Live systems
Our implementation is not presently available to the public at large. However, we will be happy to share the code with those who are interested in seeing it. Please send us an email.
Further Comments
Conclusions
We have shown how our model is able to represent the challenge workflow and to answer the proposed queries. We have also shown how the notion of a user view can allow the user to manage the complexity of a workflow through higher levels of abstraction and layering.
We would be interested in discussing with the workshop attendees about three points: (i) the meaning of a stage in a workflow (and in particular, how to interpret this notion with respect to user views), (ii) the biological significance of the procedures that can compose a workflow (in the workflow provided, what are the “significant” steps for a scientist?) and (iii) the (general) problem of computing a difference between two workflows (cf. query 7).
Aknowledgments
This work is supported by NSF grants 0513778, 0415810, and 0612177*.
(* Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.)
--
SarahCohenBoulakia - 11 Sep 2006
- tablesData-1.jpg:
- tablesData-2.jpg:
- tablesData-3.jpg:
to top