Second Provenance Challenge -- CESNET
Participating Team
- Short team name: CESNET
- Participant names: Frantisek Dvorak, Jiri Filipovic, Ales Krenek, Ludek Matyska, Milos Mulac, Jiri Sitera, Zdenek Sustr
- Project URL: http://egee.cesnet.cz/en/JRA1/
- Reference to first challenge results (if participated): CESNET
Differences from First Challenge
Note here any changes in your provenance representation, workflow enactment or system since the first challenge. Alternatively, if you did not participate in the first challenge, please provide the same details as were required for those who did (particularly workflow representation and provenance representation).
Implicit workflow representation
The
CESNET implementation of the First Provenance Challenge relied
on an explicit representation of workflow structure that was extracted from
the native workflow representation in gLite -- dependencies among
DAG subjobs specified by the user on its submission.
These dependencies were decoded and recorded as
ancestor
and
successor
attributes of the DAG subjobs and used for query implmentation.
This restriction is relaxed in the Second Challenge.
Instead, dependence between two workflow processes
is inherited from data:
Process
A is makred as
ancestor
of
B (and vice versa,
B is successor
of
A) if there is a data file
F that is output of
A and input of
B.
Logical filenames are considered for this purpose
(
name
in the
file
elements in the format definition bellow, not
physical filenames -- content of
url
elements).
For the purpose the challenge we implement this process in
an
external "sew" script.
The script is seeded with one or more identifiers of processes,
it queries recursively JP,
data dependences (common input-output files) are traversed
in both directions until the complete graph closure is found.
The found dependences are recorded with processes
in terms of
ancestor
and
successor
attributes of the first challenge;
then the challenge queries implementation remains unchaged in this sense.
Currently the script is invoked on demand. However, it can be transformed into
a part of the JP infrastructure -- an
agent which subscribes for receiving
notifications on input/output file assignments to processes, and generates
the workflow dependencies automatically.
The mechanism of generating such notifications is already available in JP.
It is used in the communication of JP Primary storage and JP Index server.
Query implementation
The queries implementation remains unchaged as implemented for the first challenge
except small adaptations described in next paragraphs.
Executable naming
The First Challenge query scripts used hardcoded executable names.
This was not a problem, the names matched exactly the values recorded by
our implementation of the workflow.
However, the naming varies among the teams, eg. it may or may not contain
absolute path to the executable.
Therefore the scripts had to be parametrized to be run with
the names appropriate for the particular data source
Timestamps
JP starts gatering data on a job virtually at the same time the job is submitted
to the Grid. Therefore, during the First Challenge, we could have used times of job registration with JP
to approximate the job run time quite accurately.
(Queries on the exact execution time were not implemented in JP that time.)
This is not true anymore in the Second Challenge.
The job is registered with JP when the data are imported, ie. typically
much later wrt. its real execution.
The query scripts were adjusted to use the true execution time.
Provenance Data for Workflow Parts
Give links here to your provenance data files for the workflow parts of the challenge: three parts for the original workflow and three parts for the modified workflow (as per provenance query 7). The data files could be attached to the results page.
Challenge data format
For the purpose of the Challenge, data are exported from Job Provenance in an XML format
conforming to a schema
available here.
The format is custom-made specifically for the Challenge in order to facilitate the data
exchange with other teams, however, it is a full-featured export format from Job Provenance:
- it is generated in an automatic way from data available in JP after running the First Challenge workflow, without any manual intervention,
- virtually all information in JP is included, despite it may not be needed for the Second Challenge as a whole,
- the exported files can be taken "as is" for importing back into JP, resulting in an equivalent functionality
An export utility used to generate the exchange files with JP queries
is
available here.
Commented example
Here we show an example of the data format.
This example was hand-edited for the sake of better readablility.
<?xml version="1.0"?>
<workflow xmlns="http://egee.cesnet.cz/en/Schema/JP/Challenge2">
<exportedStages>1 2</exportedStages>
<job id="https://skurut1.cesnet.cz:9000/yM3sz8v6WCIPgi5-0m8L4w">
<owner>/DC=cz/DC=cesnet-ca/O=Masaryk University/CN=Ales Krenek</owner>
<regtime>2006-07-11T12:22:34</regtime>
<!-- input and output files of this job -->
<inputs>
<file name="urn:challenge:anatomy1.img">
<url>gsiftp://umbar.ics.muni.cz:1414/home/mulac/pch06/anatomy1.img</url>
<url>gsiftp://umbar.ics.muni.cz:1414/home/mulac/pch06/anatomy1.hdr</url>
</file>
</inputs>
<outputs>
<file name="urn:challenge:anatomy1_yM3sz8v6WCIPgi5-0m8L4w.warp">
<url>gsiftp://umbar.ics.muni.cz:1414/home/mulac/pch06/anatomy1_yM3sz8v6WCIPgi5-0m8L4w.warp</url>
</file>
</outputs>
<!-- workflow structure: jobs that preceed and follow this one in the workflow -->
<ancestors>
<!-- empty for stage 1 -->
</ancestors>
<successors>
<!-- note the reference to the other job bellow -->
<jobid>https://skurut1.cesnet.cz:9000/wdWQHL0-RXkd3VeNcSrTaw</jobid>
</successors>
<!-- gLite middleware processing and job execution details -->
<gliteJobRecord>
<!-- omitted for readability -->
</gliteJobRecord>
<!-- user annotations, including Challenge-specific; only the latter are shown -->
<annotations>
<annotation>
<name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_STAGE</name>
<value>1</value>
</annotation>
<annotation>
<name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_PROGRAM</name>
<value>align_warp</value>
</annotation>
<annotation>
<name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_PARAM</name>
<value>-m 12</value>
</annotation>
<annotation>
<name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_PARAM</name>
<value>-q</value>
</annotation>
<annotation>
<name>http://egee.cesnet.cz/en/WSDL/jp-lbtag:IPAW_HEADER</name>
<value>global_maximum=4095</value>
</annotation>
</annotations>
</job>
<job id="https://skurut1.cesnet.cz:9000/wdWQHL0-RXkd3VeNcSrTaw">
<!-- another job in the workflow, omitted -->
</job>
<!-- further jobs follow -->
</workflow>
The root element of the file is
workflow
, correstponding to an entire exported workflow
or its parts as given by the Challenge definition. The stages present in this file
are listed in
exportedStages
.
Further second level elements are
job
's, representing the individual processes
in the workflow. Each one is assigned a unique ID already when processed by the gLite middleware.
Besides general metadata (owner and registration time) the data can be organized in
the following sections:
Inputs and outputs
file
elements refer to concrete inputs and outputs of the job.
The attribute
name
is a URI identifying the particular file uniquely.
As we didn't follow any given file naming scheme in Challenge 1,
custom
urn:
's are shown in the example. However, any suitable
file identifier can be used instead.
File name of input of the shown job has now suffix as it is the input
of the entire workflow and only a single set of inputs was given.
On the contrary, the output file name contains a unique suffix,
suggesting that this file was generated by a particular workflow run.
As some of the files in the Challenge workflow are collections of files in fact (.img and .hdr files),
we use nested
url
's (that may occur multiple times) to denote also physical file locations.
Workflow structure
Structure of the workflow is denoted by links between
job
's using
their unique identifiers, and grouped in
ancestors
and
successors
.
These links are present in the exported format regardless their
targets are exported in this part of the workflow or not.
The links are sufficient to "stitch" together separately exported workflow parts
in a unique and reliable way.
However, if they are not available explicitely, they can be still
reconstructed by searching matching inputs and outputs of the jobs.
Job processing details
gliteJobRecord
contains details on processing the job in gLite middleware.
It conforms to the
schema originally
defined for the purpose of computing job statistics in EGEE project.
These data are virtually irrelevant for the Challenge, therefore they are omitted in
this example. However, they are present in the full exported data bellow.
The contained elements are either described within the schema or they are self-explanatory.
User annotations
JP allows the user to add arbitrary "namespace:name = value" annotations to the job,
while "value" can have arbitrary complex XML structure.
The same "name" can also occur multiple times.
The annotations can be added either during job execution (usually via L&B, the gLite
service that tracks the job during its active life), or later via native JP interface.
The annotations of particular interest for the Challenge are shown above.
They correspond to tags recorded and described in
Challenge 1,
with the exception of IPAW_INPUT and IPAW_OUTPUT which are mapped specifically
in this format.
Full workflow data
Original workflow
Modified workflow
Not addressed in this challenge.
Model Integration Results
In order to get better understanding of the issues of translations between the
provenance data models we extend the challenge specification into two stages:
- translation and evaluation of homogeneous workflows (ie. data recorded in one provenance system only)
- evaluation of heterogeneous workflows (combining data from multiple systems, as requested by the orignal specification)
In both the stages available data were translated, imported into JP, and the challege queries run.
This approach allows us to focus on issues specific to translation of data from a particular
system separately, while discussing issues arising intrinsically
from the combinations (not many, actually) independently.
The translation and import process
Translation and eventual combination of the provenance data
(see Translation tools bellow) is done in the following steps:
- separate translation of parts of the workflow from they native format to our format (as defined above)
- unification of the input and output file names of the
softmean
process (part 2) to match outpus and inputs of parts 1 and 3
- adjustment of all output filenames with a unique suffix
- assignment of new unique id's to all the workflow processes
- import of the adjusted files into JP (also
- run the sew script to determine dependences between processes
Steps 2--4 are rather artificial and serve the purpose of the challenge only.
Unification of names of
softmean
inputs/outputs is necessary to trigger inheriting
dependences. If all the provenance systems gathered data on the same workflow execution,
the matching filenames in all the parts of the workflow would be the same either.
Similarly adding the unique suffix to all filenames allows us to run multiple imports
on the same input data without the need to purge the JP database between the attempts.
The same holds for assigning the new unique id's to the imported processes in step 4.
Step 6, as its side effect, produces a graph representation of the imported data.
These graphs are shown in the result section bellow.
Homogeneous workflows
ES3
Import graph
Provenance Query summary:
- OK, output
- OK, output
- OK, output
- Impossible, missing
align_warp
parameters
- Impossible, missing
global maximum
parameter
- Impossible, missing
align_warp
parameters
- Not addressed in Challenge 2
- Out of scope of JP
- Impossible, missing
studyModality
annotation
TODO:
- what are the additional three processes (coming from stage 3) in the graph?
- upload query #6 results
Karma
Import graph
More complicated due to duplicated arcs. This is caused by using different
logical names for .img and .hdr pairs of files (unlike
CESNET format which groups them
together under a single logical name).
Otherwise the graph matches expectations exactly.
Provenance Query summary:
- OK, output
- OK, output
- OK, output
- OK, output
- Impossible, missing
global maximum
parameter
- OK, output
- Not addressed in Challenge 2
- Out of scope of JP
- Not implemented.
studyModality
annotation is present, should be doable
TODO: more comments on Q9
Import graph
The graph contains number of "producer" nodes (see Translation Details bellow),
a
manually adjusted version (by removing these nodes)
meets the expectation.
Provenance Query summary:
- OK, output
- OK, output
- OK, output
- Not implemented, information on
align_warp
parameters is present but not processed by our translator
- Impossible,
global maximum
parameter may be present in the j.0:global
tag, however, the name is not unique, so the translator can't rely on it
- Not implemented, information on
align_warp
parameters is present but not processed by our translator
- Not addressed in Challenge 2
- Out of scope of JP
- Not implemented.
studyModality
annotation is present, should be doable
Import graph
The graph contains the first row of "producer" jobs, otherwise it matches expectations.
Provenance Query summary:
- OK, output
- OK, output
- OK, output
- OK, output
- OK, output
- OK, output
- Not addressed in Challenge 2
- Out of scope of JP
- Impossible, missing
studyModality
annotation
Import graph
Provenance Query summary:
- OK, output
- OK, output
- OK, output
- OK. (wrong parameters format in MINDSWAP) output,
-
ipaw_header
missing.
- OK, output
- Not addressed in Challenge 2
- Out of scope of JP
- Impossible, missing
studyModality
annotation
Heterogeneous workflows
Most of the challenge queries are affected by availability of data in
a particular part of the workflow. Therefore, in general, the results of heterogeneous
queries follow the results of the homogeneous queries on the involved provenance
system.
In particular:
- Q4, Q6:
align_warp
parameters, follow results of workflow part 1
- Q5:
global maximum
parameter, workflow part 1 again
- Q9:
studyModality
annotation, part 3
Import graph
Provenance Query summary:
- OK, output
- OK, output
- OK, output
- OK, output
- OK, output
- OK, output
- Not addressed in Challenge 2
- Out of scope of JP
- Impossible,
studyModality
annotation missing in SDG data
ES3-MyGrid-SDG
Import graph
Provenance Query summary:
- OK, output
- OK, output
- OK, output
-
ipaw_param
not presented in ES3
-
ipaw_head
not presented in ES3
-
ipaw_param
not presented in ES3
- Not addressed in Challenge 2
- Out of scope of JP
- Impossible,
studyModality
annotation missing in SDG data
Import graph
The graph contains number of "producer" nodes from
MyGrid.
Provenance Query summary:
- OK. output
- OK. output
- OK. output
-
ipaw_param
not presented in MyGrid
-
ipaw_head
not presented in MyGrid
-
ipaw_param
not presented in MyGrid
- Not addressed in Challenge 2
- Out of scope of JP
- Impossible,
studyModality
annotation missing in SDG data
Karma-SDG2-MINDSWAP2
Import graph
Provenance Query summary:
- OK. output
- OK. output
- OK. output
- OK. output
-
ipaw_head
not presented in Karma
- OK. output
- Not addressed in Challenge 2
- Out of scope of JP
- Impossible,
studyModality
annotation missing in MINDSWAP data
Translation Details
Describe details regarding how data models were translated (or otherwise used to answer the query following the team's approach), any data which was absent from a downloaded model, and whether this affected the possibility of translation or successful provenance query, and any data which was excluded in translation from a downloaded model because it was extraneous
Sections bellow briefly describe issues that raised from translating the particular provenance system
data, and importing them into JP.
The list is not complete wrt. all the participating teams.
We were not able to put the necessary effort into evaluation of all,
we have chosen more or less random sample,
based on a very subjective and brief view on the provided data.
Therefore we are not able to provide any serious assessment of the data formats
of systems that are not listed in this section.
Translation tools
For the sake of easy repeatablity of the experiments with data translations
we implemented fully automated procedures for translating the data formats
and importing the results into JP.
This is done for both homogeneous and heterogeneous workflows.
Our
CVS repository
is organized as follows:
-
export/
: JP export and import utilities, ``sew'' script for inheriting the dependences, and common code for the automated translations
- one provenance system directories: conversion tools for the particular format, specific parts of the automatic translation and import of homogeneous workflows
- three provenance systems directories: specific code for translation and import of this particular heterogeneous workflow
JP assigns
job owner to each process (X509 certificate subject).
There seems be no analogy in the other formats, therefore we supplied the value as parameter of the translators.
Most of the formats don't include explicitly information on the part of the workflow
(that matches the notion of stage in our format).
This was also supplied as an additional parameter of the translator.
ES3
- Different logical names for .hdr and .img file pairs are used (despite we understand these files to be tightly coupled). Consequently duplicate dependences among workflow processes are detected.
- File names are not consistent across boundaries of the workflow parts (eg.
reslice
outputs are not the same as softmean
inputs). We believe this to be an artifact of the challenge data rather than feature of the system, though, and we fixed the problem by manually renaming the files accordingly.
- Arguments of
align_warp
seem to be defined according to Challenge 1 example, however, these data are missing in Challenge 2.
- The
global maximum
parameter and studyModality
annotation are not supported, therefore queries 5 and 9 can't be run.
- As described at MyGrid team page each (workflow) input and output file is represented by its own "pseudoprocess" generating it. It is also true for each file on the workflow part edge. Althrough we probably find a sufficiently dicriminating criterion to identify such processes automatically (className of process BeanShellProcessor? versus StringConstantProcessor?) we don't implement it.
- Both
align_warp
parameters and global maximum
are present in the format, however, their naming is ambiguous (key of parameter is String Value
and the global maximum seems to be encoded in Ontology:4095
) according to our understanding. Therefore we could have not extracted them from the format.
- Physical filenames are not present.
- In general, the file format is rather difficult to understand and parse.
Karma
-
global maximum
is missing, yielding query 5 to be impossible
- Explicit identifiers of the process instance were missing. We used concatenation of
workflowNodeID
and serviceID
, believing it to be sufficiently unique.
-
stage
missing, we supply its value as parameter of the translator.
- In general well understandable format.
-
global maximum
is missing, yelding query 5 to be impossible.
- There is probably bug in output/input files between stage 2 and 3. Reslice jobs produce image and header files, but softmean job inports headers twice (some in hasInputImage and some in hasInputHeader tag) and no image,
- Another small bug is in parameters of
align_warp
jobs -- "-m 12" is stored as "-m -12".
- In general, the file format is rather difficult to understand.
Benchmarks
Describe your proposed benchmark queries, how the comparable quantities are determined, and the results of applying the benchmark to your own system
On Fri, 22 Jun 2007, Simon Miles wrote: There is nothing particular to prepare for this prior to the workshop, though having thought about possible suitable scenarios or queries that would make suitable benchmarks would be welcome when we come to discuss it.
Further Comments
Provide here further comments.
Conclusions
Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.
TODO (ljocha)
--
SimonMiles - 26 Oct 2006
--
AlesKrenek - 19 Feb 2007
to top