Skip to topic | Skip to bottom

Provenance Challenge


Start of topic | Skip to actions

Provenance Challenge Template

Submission in progress

Participating Team

Workflow Representation

Our workflow is represented in PtolemyII's MoML language, created in Kepler. pc.png Figure 1. The Kepler workflow for the challenge

The white and green boxes on screen are actors, that execute something.
An actor have input/output ports on which it receives/sends tokens
Data is embedded in tokens in the workflow. They are propagated in the workflow.
SDF Director
The computation model of this workflow is SDF (Synchronous Data Flow). This means sequential execution with a precomputed schedule (the most basic and simplest model of Kepler).

The implementation is a mocking workflow, where each actor is implemented in Java, taking the input files, determining the index (from 1 to 4) and generating their output file names. There is no real execution behind the workflow.

There is another mocking implementation, where each actor is actually a nested sub-workflow, building each actor from Kepler's basic actors. That workflow looks the same on the top-level. Although, the provenance record is much larger, the answer for the question is the same.

Provenance Trace

According to the RWS model, r(ead), w(rite) and s(tate-reset) event are recorded. Besides that we need to record the actors, their ports, the "tokens" flowing in the workflow among the actors, the created objects and their values.

The RWS prototype inference engine is implemented in Prolog ;-), and the provenance data is currently printed out simply as Prolog fact set, but it will be put in relational database in the future.

Tables (as Prolog predicates)
portTable Port - Actor relationship, and (atomic vs composite) type of port/actor
     portTable('.pc.align_warp2.GetIndexFromName.StringIndexOf.output', '.pc.align_warp2.GetIndexFromName.StringIndexOf', a).
     portTable('.pc.convert_x.atlas_gfx', '.pc.convert_x', c).

tokenTable Token - Object relationship. The object carries the data and it has a unique ID.

     tokenTable('.pc.convert_x.atlas_gfx.0.0', o596169037_35902573).

objectTable Object - Value relationship. Currently no type is recorded.

     objectTable(o596169037_35902573, '"/usr/home/pnorbert/Provenance/ProvCh/data/output/atlas-x.gif"', notype).

traceTable The RWS event trace, when a port is reading/writing a token, or an actor has a state-reset

     traceTable('.pc.convert_x', s, 'nil', 1).                                 - state reset of actor
     traceTable('.pc.convert_x.input', r, '.pc.slicer_x.atlas_pgm.0.0', 1).    - read a token on port 'input'
     traceTable('.pc.convert_x.atlas_gfx', w, '.pc.convert_x.atlas_gfx.0.0', 1). - write token on port 'atlas_gfx'

Trace size of the flat challenge workflow:
Table lines
portTable 81
tokenTable 30
objectTable 30
traceTable 86

trace-pc.txt: The provenance trace of the original workflow

Provenance Queries Matrix

Teams Queries
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
RWS team thumbs up smile smile frown frown frown frown frown frown

Please, note that this work focuses on single runs only. Data provenance of multiple-runs and workflow provenance are addressed by others in the Kepler community and eventually all work are expected to converge toward a unified provenance framework. Question 2 and 3 are answerable, just we slipped out of time to construct the appropriate Prolog queries for them.

Provenance Queries

The inference engine prototype is implemented in Prolog. The basic information we need is the token lineage of a given token, i.e. all tokens in the workflow on that the given token depended. Then values, objects and actors can be looked up from the provenance tables.

The predicate tokenLineageOfValue( Value, List ) provides the list of all tokens on that the given value (more accurately the token that contained this value first) depended. This predicate basically generates the transitive closure of the direct dependency among tokens, back to the very first tokens, i.e. to the inputs.

valueLineageOfValue/2 and actorLineageOfValue/2 use the above predicate and then look for the token's value it contains or the actor that generated it, resp.

Since the dependency graph may contain several paths back to a certain token and also several tokens can be created by the same actor, we may get an actor or value in the list several times. Therefore, we use the list_to_set/2 built-in predicate to make each resulted element unique.

1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.

Input: the output file name as "/usr/home/pnorbert/Provenance/ProvCh/data/output/atlas-x.gif"

Output: a set of actors that contributed and data values (file names) that led to this file

Answer 1.a. List of actors that contributed to the result: (21 actors).

They appear in reversed order as they were executed.

?- q1_actors('"/usr/home/pnorbert/Provenance/ProvCh/data/output/atlas-x.gif"', ActorList), print(ActorList).

Note: new lines entered manually in the doc for easier read.

Answer b. List of input and intermediate values created by the workflow (26 values).

?- q1_values('"/usr/home/pnorbert/Provenance/ProvCh/data/output/atlas-x.gif"', ValueList), print(ValueList).

Suggested Workflow Variants

Like the myGrid/Taverna team, we also created a more generic version of the challenge workflow, which works for any number of input images, provided that the Softmean can take any number of input images at once.

pca.png Figure 2. The generalized Kepler workflow for the challenge

The difference between this and the Taverna workflow is that we create the input file names within the workflow one-by-one. A list of input files would be given as one token to the first AlignWrap actor, which supposedly wants to get them one-by-one. The 4 outputs of the Reslice are collected into an array and Softmean is executed only once. The output of Softmean is repeated 3 times with different slice parameters (generated by the seqXYZ actor), thus executing the final two operations three times.

The answers for the first query are basically the same but with some changes. The actor list is shorter, reporting e.g. Reslice once instead of Reslice1,...,Reslice4, however, the additional array and repeat operations appear in the list. The value list becomes larger because of the additional array and repetition tokens.

Suggested Queries

Categorisation of queries

Live systems

Further Comments


-- NorbertPodhorszki - 07 Sep 2006
to top

End of topic
Skip to action links | Back to top

I Attachment sort Action Size Date Who Comment
pc.png manage 70.0 K 07 Sep 2006 - 16:41 NorbertPodhorszki Kepler workflow, flat version
pca.png manage 47.5 K 07 Sep 2006 - 17:12 NorbertPodhorszki Generalized Kepler workflow
trace-pc.txt manage 15.4 K 07 Sep 2006 - 17:42 NorbertPodhorszki The provenance trace of the original workflow

You are here: Challenge > FirstProvenanceChallenge > ParticipatingTeams > RWS

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.