Discussion and Outcomes of the First Provenance Challenge
Discussion 1
After the presentation of each team's results, several discussions were held centering on how to move forward as a community. A broad consensus developed that there was need for a second challenge. This, current challenge had shown that, for the most part, each team was able to extract the right kinds of information about the challenge workflow to be able to answer the provenance queries that had been set. However, the issue remained that it was unclear whether the data sets obtained by each team were equivalent.
Terminology
One issue that was clear was that each team was using different language/terminology to describe their provenance systems and models, and that this was clouding the issue as to whether or not the team's models were compatible. An idea was put forward for each team to develop a glossary of terms for their respective systems and to try to link their terms to analogous terms from other teams. These glossaries are to be published on the challenge Twiki.
What is a Query
One participant in the workshop discussed the nature of provenance queries, stating that some of the queries in the challenge seemed more about annotations than provenance. This was countered by another participant saying that provenance information can itself be considered a form of annotation, but perhaps one that incorporates the notions of time and causality. It was also pointed out that to obtain true provenance, it was often necessary to incorporate standard annotation information. For example, understanding how a particular data item came about often entails knowing who initiated the process that produced the data item. In this case annotations about that person are also useful, such as his or her name and role, etc.
Interoperability
The discussion then turned to a more detailed understanding of how the teams' systems might be compatible. Even if the teams' systems differed, there might be a route to interoperability if a level of abstraction could be found that all systems adhered to.
Causal Graphs
Luc Moreau raised the possibility that all systems seem to operate around a graph representation of information. All teams agreed that this was true. It was pointed out, however, that this may be because of the nature of the challenge, i.e. that it was based around a workflow and thus this necessitated the graph representation. The question then became whether provenance was always related to workflows. However, at least one of the teams (Harvard) did not explicitly make use of a workflow representation. Indeed the Southampton team considers provenance both inside and outside of workflows and believes it to be important not to confine provenance to workflows. It became clear, however, that more discussion was needed to establish the role played by workflows in computational provenance systems.
Different levels of explanation
During the presentation of each team's results, it became clear that many different levels of information were being captured amongst the teams. This relates to the notion of levels of explanation. At one extreme there was the
PASS system (Harvard) which collected information at an extremely fine level of granularity whilst many other systems collected information at higher levels. Each approach has its benefits -- that is, lower levels capture more information and potentially offer the ability to answer queries that arise later, however sifting through the large amount of information is a work intensive process. Higher levels of information make answering queries easier, but risks losing information that might be useful for future queries. This also impacts upon interoperability, and it is an interesting question whether or not the systems can be connected when they represent and use information at different levels of abstraction.
The interoperability challenge
It was suggested that for the next provenance challenge, teams should pair up to see if they can make their respective systems interoperable. This would involve passing information from one system to another in such a manner that the latter system can understand the information it receives. A simple text-based or XML output from systems was considered the best approach so that data conversion issues could be avoided.
Outcome of Discussion
The outcomes of the discussion are as follows:
- Each team is to produce a glossary of terms with links to analogous terms to other team's glossaries. This will be done via the challenge Twiki.
- A Twiki page will be set up for participants to engage in a discussion on provenance queries -- whether provenance and metadata are distinct entities or are connected.
- A Twiki page will be set up to allow a space for teams to link up with other teams in order to begin the interoperability challenge. Special attention is to be given to the problem of different levels of abstraction that teams adopt in their systems, and at which point this should be handled (design time? record time? query time?) Luc: If you know your level of description you ca then optimise on what you record.
Discussion 2
On the second day of the challenge, another discussion took place. This time the discussion centered around two of the points brought out in the initial discussion:
- Time and Provenance
- Provenance outside of workflows
Time and Provenance
The question was raised “what role does time play in provenance, how should it be recorded?” Views differed with some participants believing that it was just another form of state information, others believing it to be a form of annotation. The point was brought up that provenance information does not necessarily need time information if provenance is considered to be a causal graph. Furthermore, linking different systems and their respective views of time is problematic, since distributed clocks pose many synchronicity problems. Understanding what a time stamp means in a distributed system is difficult. It was discussed how time should be understood as an ontology that all systems could interpret. A need was identified for queries in which time information makes an explicit difference to the answer.
Provenance Outside of Workflows
Can the community deal with provenance queries that require information not defined in workflows? This can occur if we consider annotations of data as not part of a workflow; however, this does not deal with the important notion of causal events, outside of a workflow and how they can be interpreted/captured by any of the teams’ systems.
A participant noted that workflows are central to our approach because they enable many other forms of information to be associated to them that can be used by the systems, i.e. annotations. What is needed is for the community to develop a generalised example that does not rely on a workflow representation of activity.
The outcome of the discussion was inconclusive, with some teams holding to the centricity of workflows; others believing that a definition of provenance should not rely on workflows.
Discussion 3
The final discussion of the workshop sought to pull together the issues of the preceding discussions in order to clarify a way forward. Luc Moreau proposed that an article should be written that would expose the differences and similarities with each teams’ approach. This would act as a ‘readers guide’ to provenance systems. The article should attempt to address five points:
- The purpose of provenance systems
- The specifics of a provenance system
- The model of provenance that is emerging in the community
- The different kinds of systems that exist in the community
- The kinds of queries for which provenance can be used to answer
A Twiki will be set up that on which each team can describe their systems (this is to be completed by mid November 2006), and a classification of provenance systems will be developed. Feedback to each team will occur before Christmas 2006 and final versions to be submitted by the end of January 2007.
During this period, each team will also supply their glossaries as discussed in the first discussion.
The Interoperability Challenge
Final discussion relating to the forthcoming Interoperability Challenge also occurred.
The challenge would work on the same workflow as the previous challenge, except that each team is work on only a portion of the workflow (instead of the whole workflow as before), while their partner works on another portion. Each team is to use provenance information from the part of the workflow derived from their partner team and incorporate this into the provenance information they derive from their section of the workflow. This will test the interoperability of each team’s provenance information and test their ability to answer queries using provenance information derived from different provenance systems.
It was decided that each team was to export data from their system to their partner team in a text based format to avoid data conversion problems. The partnering team should then try to import this data into their system and combine it with their own information.
One suggestion is to use the University of Maryland’s provenance ontology developed for the first challenge. The release of this data is to be completed by January 2007.
In June 2007, in Monterey, discussion will be held regarding progress. The next Provenance challenge is scheduled for the Autumn/Winter of 2007.
Finally, the management of this challenge is to be jointly held by Luc Moreau, Jim Myers and Mike Wilde, who will work on a new set of queries for the challenge.
--
SteveMunroe - 06 Oct 2006
to top