Data and Knowledge Systems
Participating Team
- Participant names: Shawn Bowers, Tim McPhillips, and Bertram Ludaescher, in collaboration with Norbert Podhorszki and Ilkay Altintas.
- Project Overview: The Data and Knowledge Systems (DAKS) group at UC Davis is developing the Collection-Oriented Workflow paradigm and implementing this approach in the Kepler Workflow System. See McPhillips & Bowers (2005) and McPhillips et al (2006) listed below for more information.
- Provenance-specific Overview: Among other benefits, collection-oriented workflows enable comprehensive data and process lineage information to be recorded and passed through the workflow along with data. We demonstrate this capability in this challenge. Our approach is an adaptation of the RWS provenance model described in Bowers et al (2006) listed below. Our approach takes advantage of the collection-oriented workflow framework to:
- Automatically infer state-reset events based on the declared scope of actors.
- Minimize the number of provenance-relevant events that must be recorded.
- Simplify association of workflow runs with data provenance by storing workflow inputs, outputs, and dependency information in a single, self-contained trace file.
- Support science-oriented provenance queries, emphasizing data dependencies (lineage) as well as process details.
- Decouple provenance representation from particular scientific workflow technologies (e.g., Kepler).
- Relevant Publications:
- An Approach for Pipelining Nested Collections in Scientific Workflows, Timothy McPhillips and Shawn Bowers, SIGMOD Record 34, 12-17, 2005.
- A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows, Shawn Bowers, Timothy McPhillips, Bertram Ludaescher, Shirley Cohen, Susan B. Davidson. International Provenance and Annotation Workshop (IPAW'06), 2006.
- Collection-Oriented Scientific Workflows for Integrating and Analyzing Biological Data, Timothy McPhillips, Shawn Bowers, Bertram Ludaescher. 3rd International Workshop on Data Integration in the Life Sciences (DILS'06), 2006.
Workflow Representation
Kepler implementation of the Challenge Workflow. We implemented the Challenge workflow in Kepler as shown below. Actors labeled
AlignWarp,
ResliceWarp,
SoftMean,
Slicer, and
Convert correspond to the five stages of the Challenge workflow. The actors labeled
CollectionReader and
CollectionWriter import data into the workflow and save the output/trace of the workflow, respectively. The actor
ReplicationCollection creates two additional copies of the products of
SoftMean so that downstream actors will execute three times, once for each desired slice of the average image.
How the collection-oriented actors (coactors) work. Our collection-oriented workflow framework provides generic support for operating over nested collections (i.e., trees) of scientific data. Coactors differ from conventional Kepler actors (such as those used in the
RWS solution to this Provenance Challenge) in that rather than operating on flat, homogeneous streams of tokens, coactors operate on trees of heterogenous data. A coactor is invoked whenever a subtree of the input stream matching certain criteria (e.g., the declared
scope of the coactor) is received. During an invocation, the coactor may optionally add or delete nodes within the subtree upon which it was invoked. The figure below illustrates how the
AlignWarp actor operates on an AnatomyImage collection, adding a WarpParamSet to this collection. All data received by
AlignWarp outside of its scope passes through the coactor transparently.
In Kepler, collections are serialized and streamed through coactors. Because actor execution is pipelined based on each actor’s scope, this approach enables concurrent processing of nested data collections as shown below. The figure illustrates how delimiter tokens (in blue and green) are used to bracket nested collections of associated data (in white), metadata (in red) and actor parameters (not shown).
Input collections drive workflow execution. The collection-oriented implementation of the Challenge workflow may be configured to operate on different numbers of input anatomy images, not by modifying the workflow definition, but by customizing the input to the workflow. We tested our provenance system using two different input data sets represented by two XML files. The first input file,
input1.xml, corresponds exactly to the Challenge workflow and contains four AnatomyImage collections within a single ImageCollection collection (see tree representation below). The second input file,
input2.xml, contains three ImageCollections comprising four, three, and two AnatomyImage collections respectively. In other words, our implementation can operate on varying numbers of anatomy images within a single run of the workflow. Moreover, parameter values for particular actors also may be embedded within the workflow input to override default parameter values for particular sub-collections of data (note Parameter elements in the two input XML files).
Provenance Trace
The results of each run of the Challenge workflow (including input data, intermediate and final data products, as well as provenance) were recorded in a trace file by the
CollectionWriter actor and may be downloaded here:
trace1.xml,
trace2.xml. Trace files are implemented in XML using the same schema used for workflow input files read by
CollectionReader. (An execution of a collection-oriented workflow may be thought of as a process of incrementally elaborating the input XML document.) The figure below shows the beginning of such a trace, highlighting how little additional information must be added to the trace file to record data lineage and invocation dependencies.
As illustrated above, data and invocation dependencies are represented in the trace as special XML elements describing the provenance of other elements. Insertion and deletion elements record the actor, actor invocation count, and direct data dependencies associated with event that created or removed the element following it in the document. InvocationDependency elements record which invocations of preceding actors created data or modified collections used in the current actor invocation. Insertion, deletion, and invocation dependency information is passed through the workflow as special tokens during workflow execution. Coactors declare data dependencies explicitly during execution, whereas invocation dependencies are inferred and inserted into the token stream by the framework automatically. The figure below illustrates two data dependencies graphically.
From collection-oriented execution traces we can construct
data-lineage graphs. Vertices in a data-lineage graph represent input, output, and intermediate data and collection items. Edges denote item dependencies, which are further labeled with the actor invocations involved in the creation or modification of the item. In general, collection-oriented traces, and their corresponding data-lineage graphs, enable a wide range of queries over both process and data dependencies.
Provenance Queries
We have implemented a prototype system for querying collection-oriented execution traces. The system is written in Prolog and can manage and query multiple execution traces. The system provides a number of primitive operations for accessing and querying execution traces, some of which are demonstrated below.
Core Provenance Queries
1. Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed, etc. To answer this query, we return the subset of edges of the data-lineage graph that correspond to paths beginning with the desired Atlas X Graphic data node. Given the Atlas X Graphic data node (
AtlasXGraphic
) and the trace (
Trace
), the following expression gives the corresponding edges of the data-lineage graph.
lineageEdges(Trace, [AtlasXGraphic], Edges)
The
lineageEdges
predicate, a primitive query operator provided by our system, computes the set of edges that define paths starting from each of the given set of nodes. The following query (1) obtains the first trace (with the trace id '1'), (2) obtains the desired Atlas X Graphic output node of the trace (with the node id of '341'), (3) computes the corresponding portion of the data-lineage graph, and (4) draws the resulting graph edges.
?- traceId('1', Trace),
nodeForId(Trace, '341', Node),
lineageEdges(Trace, [Node], Edges),
drawTraceEdges(Edges, 'pq1', gif).
Each of the predicates above are implemented as primitive operations within the provenance query engine. The graph that results from running this query is shown below.
The following query computes and draws the corresponding data-lineage graph for the second trace, described above. Since there are three separate image collections used in the execution (i.e., the three collections are pipelined through the workflow), the result consists of three independent Atlas X Graphic objects.
?- traceId('2', Trace),
nodeForId(Trace, '973', Node1),
nodeForId(Trace, '1093', Node2),
nodeForId(Trace, '1193', Node3),
lineageEdges(Trace, [Node3, Node2, Node1], Edges),
drawTraceEdges(Edges, 'pq1_trace2', gif).
The graph that results from running this query is shown below. Note that in this example, only a subset of the input images are used to derive the corresponding output graphics.
2. Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean. This query is similar to query 1 above, but we filter out edges of the data-lineage graph that correspond to invocations occuring prior to SoftMean computations. The following query computes and draws the corresponding data-lineage graph for the first trace.
?- traceId('1', Trace),
nodeForId(Trace, '341', Node),
lineageEdges(Trace, [Node], Edges),
filterBeforeActor(Trace, Edges, 'SoftMean', FilteredEdges),
drawTraceEdges(FilteredEdges, 'pq2', gif).
The filtering step is performed using the
filterBeforeActor
operation provided by the query engine. The graph that results from running this query is shown below.
3. Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic. Note that the result of this query is identical to query 2. Here we show an alternative method for computing the result. Instead of filtering out data-lineage edges that correspond to invocations prior to SoftMean, we select those edges denoting invocations after ResliceWarp.
?- traceId('1', Trace),
nodeForId(Trace, '341', Node),
lineageEdges(Trace, [Node], Edges),
selectAfterActor(Trace, Edges, 'ResliceWarp', FilteredEdges),
drawTraceEdges(FilteredEdges, 'pq3', gif).
The selection step is performed using the
selectAfterActor
operation provided by the query engine. The graph that results from running this query is identical to the one for query 2.
4. Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday. The following query returns the set of invocations of AlignWarp having the given parameter.
?- traceId(TraceId, Trace),
traceInvocParam(Trace, 'warpParameters', '-m 12', 'AlignWarp', Invoc).
The query uses the
traceInvocParam
primitive operation. This operation uses embedded parameter tokens in the trace (i.e., input stream) to reconstruct the parameters applied to particular actor invocations. Note that in our prototype we do not currently assign metadata to traces, however, such metadata would be simple to add. The result of running this query on our two traces is:
TRACE = 1 ACTOR = AlignWarp INVOCATION = 1
TRACE = 1 ACTOR = AlignWarp INVOCATION = 2
TRACE = 1 ACTOR = AlignWarp INVOCATION = 3
TRACE = 1 ACTOR = AlignWarp INVOCATION = 4
TRACE = 2 ACTOR = AlignWarp INVOCATION = 5
TRACE = 2 ACTOR = AlignWarp INVOCATION = 6
TRACE = 2 ACTOR = AlignWarp INVOCATION = 7
TRACE = 2 ACTOR = AlignWarp INVOCATION = 8
Note that only two of the image collections of the second trace use the given parameter. In addition, the parameter is used for only two of the anatomy images in one of the input image collections.
5. Find all Atlas Graphic images output from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility. The following query (1) selects input nodes of the trace, (2) checks that the input node is of type ImageHeader, (3) checks that the header meets the given criteria, and (4) obtains the output nodes of type AtlasGraphic of the trace.
?- traceId(TraceId, Trace),
traceInputNode(Trace, X),
nodeType(X, 'ImageHeader'),
headerQuery(X),
traceOutputNode(Trace, AtlasGraphic),
nodeType(AtlasGraphic, 'AtlasGraphic').
The
traceInputNode
,
nodeType
, and
traceOutputNode
predicates are primitives of the query engine. Here, we assume that
headerQuery
is a user-supplied predicate that applies the global maximum check. Note that we did not add the capability of calling external applications to our current prototype. We envision the ability to call such external functions as part of a broader data management facility (e.g., within Kepler), as opposed to a provenance task. In our prototype, we wrote
headerQuery
to succeed for one header from each trace. The result of running this query on both traces is:
TRACE = 1 TYPE = AtlasGraphic TOKEN = 341 OBJECT = 68
TRACE = 1 TYPE = AtlasGraphic TOKEN = 349 OBJECT = 70
TRACE = 1 TYPE = AtlasGraphic TOKEN = 357 OBJECT = 72
TRACE = 2 TYPE = AtlasGraphic TOKEN = 1093 OBJECT = 225
TRACE = 2 TYPE = AtlasGraphic TOKEN = 1101 OBJECT = 227
TRACE = 2 TYPE = AtlasGraphic TOKEN = 1109 OBJECT = 229
TRACE = 2 TYPE = AtlasGraphic TOKEN = 1193 OBJECT = 242
TRACE = 2 TYPE = AtlasGraphic TOKEN = 1202 OBJECT = 244
TRACE = 2 TYPE = AtlasGraphic TOKEN = 1210 OBJECT = 246
TRACE = 2 TYPE = AtlasGraphic TOKEN = 973 OBJECT = 199
TRACE = 2 TYPE = AtlasGraphic TOKEN = 981 OBJECT = 201
TRACE = 2 TYPE = AtlasGraphic TOKEN = 989 OBJECT = 203
We note that the particular wording of this query assumes that all output graphics depend on all input images and headers. For our second trace, one can easily verify that this assumption is incorrect. Alternatively, it is possible to rewrite this query using primitive operations of the query engine so that only proper derivations are returned.
6. Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12." The following query (1) obtains output averaged images of SoftMean invocations, (2) obtains the set of lineage edges leading from the averaged images, and (3) ensures that at least one edge corresponds to an AlignWarp invocation with the appropriate parameter model.
?- traceId(TraceId, Trace),
actorInvocation(Trace, 'SoftMean', _, _, AveragedImage),
nodeType(AveragedImage, 'Image'),
lineageEdges(Trace, [AveragedImage], Edges),
member((_, _, 'AlignWarp', Invoc), Edges),
traceInvocParam(Trace, 'warpParameters', '-m 12', 'AlignWarp', Invoc).
This query uses the
actorInvocation
operation, which returns input nodes, output nodes, and invocation counts for a given actor within a trace. The result of running this query over both traces is:
TRACE = 1 TYPE = Image TOKEN = 311 OBJECT = 65
TRACE = 2 TYPE = Image TOKEN = 1065 OBJECT = 222
TRACE = 2 TYPE = Image TOKEN = 1165 OBJECT = 239
7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant. We left this query open due to the ambiguity of what a proper result should be. It is not clear, at least for this example, what type of result would be useful for a user (i.e., a scientist). For example, to guage the difference between workflow executions, one may simply want to perform a "diff" on run outputs (which is partially supported in our current prototype for underlying data objects). Alternatively, we can imagine that users may want to compare particular data derivation paths across the runs (which is again possible at the object-level within our system), or compare different actor invocation patterns.
8. A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago. The following query leverages collection-oriented metadata to select the anatomy images with the given key-value pair. The query (1) obtains invocations of AlignWarp, (2) obtains input image nodes and corresponding output nodes of AlignWarp invocations, and (3) checks that the given input image has the appropriate metadata.
?- traceId(TraceId, Trace),
traceInvoc(Trace, 'AlignWarp', Invoc),
actorInvocation(Trace, 'AlignWarp', Invoc, InputNode, OutputNode),
nodeType(InputNode, 'Image'),
nodeMetadata(Trace, 'center', 'UChicago', InputNode).
This query uses the
actorMetadata
primitive operation to check that the given node has the correct key-value pair metadata. The result of running this query on the two traces is:
TRACE = 1 ACTOR = AlignWarp INVOCATION = 1 TYPE = WarpParamSet TOKEN = 245 OBJECT = 53
TRACE = 2 ACTOR = AlignWarp INVOCATION = 1 TYPE = WarpParamSet TOKEN = 851 OBJECT = 176
9. A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files. For this query, we assume that the user-annotation is given as collection-oriented metadata within the trace (i.e., the metadata is available in the trace as opposed to only being available via a data management subsystem). We also assume that a Graphic Atlas "set" consists of all Atlas Graphics derived from an invocation of Softmean. Thus, the Atlas X, Y, and Z Graphics generated from Softmean correspond to a single set, and there are three such sets generated by our second example trace.
The following query computes the sets of Atlas Graphics, where at least one Atlas Graphic has the desired metadata annotation.
?- traceId(TraceId, Trace),
traceInvoc(Trace, 'SoftMean', Invoc),
graphicAtlasSet(Trace, Invoc, GraphicSet).
The
graphicAtlasSet
is defined specifically for this query as follows:
graphicAtlasSet(Trace, Invoc, GraphicSet) :-
setof(G, graphicAtlas(Trace, G, Invoc), GraphicSet),
member(Graphic, GraphicSet),
member(Modality, ['speech', 'visual', 'audio']),
nodeMetadata(Trace, 'studyModality', Modality, Graphic).
graphicAtlas(Trace, AtlasGraphic, SoftMeanInvoc) :-
traceOutputNode(Trace, AtlasGraphic),
nodeType(AtlasGraphic, 'AtlasGraphic'),
lineageEdges(Trace, [AtlasGraphic], Edges),
member((_, _, 'SoftMean', SoftMeanInvoc), Edges).
The complexity of this query is due to the generation of sets of Atlas Graphics, which is similar to peforming a group-by operation in SQL, and then further filtering groups by corresponding metadata values. Note that above we do not further return the given metadata values of the graphics (although one easily could). The result of running the above query on the two traces is:
TRACE = 1 TOKEN SET = {341, 349, 357}
TRACE = 2 TOKEN SET = {1093, 1101, 1109}
TRACE = 2 TOKEN SET = {1193, 1202, 1210}
Note that as shown above, only two of the input image collections for the second trace have matching Atlas Graphic sets.
Suggested Queries
One of the benefits of our approach (and similarly with the
RWS approach) is its ability to support various data-lineage queries (as opposed to process-oriented queries). The following two examples demonstrate more "scientist-oriented" queries over data lineage.
10. Find all of the intermediate (not input or output) Images used to derive the Atlas X Graphic. (A variant is to find the "closest" Image on the derivation path from the given output.) The following query (1) obtains one of the Atlas X Graphic outputs for the second trace, (2) obtains the lineage edges from the Atlas X Graphic, (3) selects an image that was used to derive the Atlas X Graphic, and (3) checks that the image was not an input to the workflow.
?- traceId('2', Trace),
nodeForId(Trace, '1093', Node),
lineageEdges(Trace, [Node], Edges),
member((_, DepNode, _, _), Edges),
nodeType(DepNode, 'Image'),
setof(N, traceInputNode(Trace, N), InputNodes),
\+ member(DepNode, InputNodes).
The result of running this query is:
TYPE = Image TOKEN = 1001 OBJECT = 205
TYPE = Image TOKEN = 1018 OBJECT = 208
TYPE = Image TOKEN = 1065 OBJECT = 222
11. Find all of the input Images used to derive the Atlas X Graphic. Note that this query is of particular importance for the second trace, where not all input images were used to derive output graphics. The following query (1) obtains an Atlas X Graphic for the second trace, (2) obtains the lineage edges from the Atlas X Graphic, (3) selects an Image that was used to derive the Atlas X Graphic, and (3) checks that the Image was an input to the workflow.
?- traceId('2', Trace),
nodeForId(Trace, '1093', Node),
lineageEdges(Trace, [Node], Edges),
member((_, DepNode, _, _), Edges),
nodeType(DepNode, 'Image'),
setof(D, traceInputNode(Trace, D), InputNodes),
member(DepNode, InputNodes).
The result of running this query is:
TYPE = Image TOKEN = 927 OBJECT = 190
TYPE = Image TOKEN = 932 OBJECT = 192
TYPE = Image TOKEN = 937 OBJECT = 194
TYPE = Image TOKEN = 942 OBJECT = 196
We note that a number of similar types of data-lineage queries are defined in (Bowers et al, 2006).
Suggested Workflow Variants
Our approach can support a variety of workflow constructs, including pipelining and partial data dependencies (e.g., as illustrated in the second example trace), as well as concurrent actor execution and cyclic workflow graphs (looping and iteration). We have found that workflows in bioinformatics typically exhibit some or all these features, which we would like to see in future scientific-workflow provenance challenges.
to top