Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.ModelDiscussionTime

Start of topic | Skip to actions

Model Discussion: TIME

Our proposition for TIME (consisting of associating time with used and generatedBy edges), to capture creation and read time, is very data centric.

Juliana, rightly, suggested a process view could also be valid (and quite realistic from an implementation viewpoint). In such a view, processes have a start and end time (which may or may not be related to reading/creating artifacts, but is easy to capture at the OS level). Looking at our model, the catalizedBy edge between a process and an actor could be associated with a start and end time for this process. It would allow for multiple accounts and would nicely complete our model.

What do you think?

A consequence of this, is that: now we can attach creation/reading time to artifacts (by means of causal dependencies), and begin/end time of processes (by means of catalizedBy dependency), it seems that we should avoid associating time to "summary" edges (triggeredBy and derivedFrom). This avoid having a weak semantics attached to this notion of time: e.g., is this the time of the process that finished or the process that started, or is it the time of the read data or the time of the created data.

What do you think?

Luc

-- LucMoreau - 14 Aug 2007


I greatly prefer the current scheme. The process that generated an artifact may have created the artifact long after it began executing and it might still be running, so we may have no information about when the artifact was created or modified. Systems that can't determine exactly when an artifact was read or written can use process times to fill in these values, and we can define the semantics to allow this ("the time interval on a generatedBy edge includes the interval in which the artifact was created").

Patrick Paulson - 14 Aug 2007


The suggestion is not to replace the current model, but to have both notations coexisting! If you want to talk about time for processes, attach it to processes, likewise for artifacts.

Luc

-- LucMoreau - 14 Aug 2007


Yes, sounds great.

Patrick Paulson - 14 Aug 2007


People are always free to annotate with verbs outside our model - I can put dc:creator on an artifact rather than specifying an agent, but is that the provenance model (i.e. the argument that we should only be talking about the time of causality in our model from the last email).

I would have no problem with us defining inference rules for common annotations from other schema based on our core model as a convenience, or generally allowing people to add non-causal annotations as additional metadata - that could be very helpful in mapping to workflow and digital library views of the world. But if we're adding them to the model itself, I think we need a good causality/provenance reason to do so.

Jim

-- JimMyers - 14 Aug 2007


> and we can define the semantics to allow this ("the time
> interval on a generatedBy edge includes the interval in which the
> artifact was created").

but we don't have intervals, so we have to define just one sense - the time on the generatedBy edge is >= the completion of the creation of the artifact.

-- JimMyers - 14 Aug 2007


At 03:42 PM 8/14/2007, Luc Moreau wrote: > Our proposition for TIME (consisting of associating time with used and
> generatedBy edges), to capture creation and read time, is very data
> centric.

Not really - if there are no durations, the used edge is at the process start and the generated by is the end. (Our notion is process centric because the data starts getting created when the file is opened and is finished when it is closed, which may be different from when the process claims to have generated it.... i.e. there are data and process centric ways of describing the scheme that are equally valid.)

> Juliana, rightly, suggested a process view could also be valid (and
> quite realistic from an implementation viewpoint). In such a view,
> processes have a start and end time (which may or may not be related
> to reading/creating artifacts, but is easy to capture at the OS
> level).
> Looking at our model, the catalizedBy edge between a process and an
> actor could be associated with a start and end time for this
> process. It would allow for multiple accounts and would nicely
> complete our model.

How would catalyzed by be associated with both times? How would it allow for multiple accounts? I may launch a job that gets scheduled for tomorrow - when do I consider the process to have started?

Overall, I don't think that moving time stamps to the agents solves the problem any better than what we have and, in some sense, it would start to model agency as a process as well (the user pushed a button and sent a '1' into the engine to start things - that process and data transfer has a time, whereas a user causing/catalyzing a process is a different thing that doesn't have a time (when did I start causing the process - when I first thought about it?).

If we start putting times on catalyze relationships, I think we're starting to sneak control flow into the model. As it is, agents are not a direct part of the process and catalyzed means something like depends on but the exact details of the processes and data transfers involved are unknown. If we start saying "but we know when" the catlysis happened, we're starting to be inconsistent in some sense.

> What do you think?
>
>
> A consequence of this, is that: now we can attach creation/reading
> time to artifacts (by means of causal dependencies), and begin/end
> time of processes (by means of catalizedBy dependency), it seems that
> we should avoid associating time to "summary" edges (triggeredBy and
> derivedFrom). This avoid having a weak semantics attached to this
> notion of time: e.g., is this the time of the process that finished or
> the process that started, or is it the time of the read data or the
> time of the created data.

I've argued against time on the catalyzed link already - I think you can get the process start/end time as easily as you get data times from the used/generated by links and, as I think we discussed w.r.t. time durations, if you can't live with the innacuracies introduced by assuming things are instantaneous or coincident, model the process in more detail. I.e. if it is significant that the process started at one time and the data came in later, model the subprocesses and show the reading step happened later in the subprocess graph.

A process starts at or before the usedby time and the data read is complete at or after that time. A process ends at or after the generatedby time and the data was produced at or before that time. Decrease the uncertainty by modeling the subprocesses.

> What do you think?

The model we had and argued for here does have an issue - what happens if there are no modeled inputs or outputs (a random number generator was run - where does the start time go if there's no input, only and agent that starts it?) - do we require a dummy input or output (e.g. require that you model the 'seed' to the number generator) to get a start time? Or do we now have to allow a time on a catylzed link? I'd still prefer not to have times on catalyzed but don't know how onerous it will be to always have to have an input.

What do you think? smile

Jim

-- JimMyers - 14 Aug 2007

If it's like you suggest, than I misunderstood. What' is the meaning of the time associate with a used edge?

The time you read date or the time you start the process?

If the latter, when a process has two used edges, than it may have two different start time ...

-- LucMoreau - 14 Aug 2007

Good point - any difference in times on two used by edges would be getting to internal details of the process. One can still infer a process start time <= the earliest usedBy time, so the model as I interpreted would still be consistent and understandable, but it does give you detail about the unknown innards of the process.

I think we started trying to put times on edges because we saw the symmetry - you can infer something about process start/end times from data creation times and vice versa. And I think we could come up with issues with reporting time in either place (if the data is created and sits for two years, it's creation time doesn't place much restriction on the process start time, and conversely, the start time for a process that starts and polls for data for two years does not tell you much about when the data was read.) I think this leads us to either require both creation times for data and start time (or start/end) for processes or go for the compromise of just talking about the times of the instantaneous used and generatedBy relations and say that if you want more detail, you model the subprocesses. I think either of these choices is still valid...

-- JimMyers- 14 Aug 2007

> You can have multiple accounts with different catalizedBy edges, and therefore associate times.

So catalyzedBy relationships have to be associated with an account as well? I guess this may be required in any case (we agree the document was written, but who wrote it?)

> If you want to distinguish launch/schedule/up/down time, open the box, and make all the relevant components clear.

But the agents aren't in the process box - the Jim agent and the workflow agent both catalyze the process - if we allow multiple/hierarchical agents, it would be better to annotate the process directly with start/end times (and then we're back to the discussion about symmetry above...)

> Again, the point is not to replace what we have, but to complement it.

Do we then constrain the agent start/end times must be before any used/after generatedBy times for an account to be consistent? (Same issue if we just switch to having artifact creation times and process start/end times.)

> > If we start putting times on catalyze relationships, I think we're starting to sneak control flow into the model. As it is, agents are not a direct part of the process and catalyzed means something like depends on but the exact details of the processes and data transfers involved are unknown. If we start saying "but we know when" the catlysis happened, we're starting to be inconsistent in some sense.
> >
>
> My (possibly incorrect) reading of last week's meeting was that "time associated with used-edge means reading time".
> This does not say anything about when process started.

It implies that the process started before the first read, that's all. And then 'if you want more info, model the subprocesses...'

-- JimMyers- 14 Aug 2007

One last comment from another angle. To summarize at this point - just annotating the used/generatedBy relations only puts <= or >= inferences on data creation/read and process start/end times, which is being argued is not good enough since process start/end times are measurable and important (as are data creation/read times). Then we have various issues of where to put annotations on artifacts and processes if there are different accounts and agents, etc.

However, if we're really modeling causality and not workflow (or digital libraries), the only thing that matters is when causation takes place - the process start time has no direct effect on causality. I can see that in practice observers may know the process start time and not the data read time and therefore would have to report a used time equal to the process start time, but that's really an issue for the observer to decide (or to report on subprocesses) - if an observer wants to report process start time they are free to do as an annotation outside our model, but if they can't do the inference to report a used time, they are not actually reporting on causality yet. (Making the provenance engine infer causality is bad form...).

-- JimMyers- 14 Aug 2007


Aack, what a long thread.

I'm arguing for both 'begin' and 'end' times -- durations -- on both used and generatedBy. (I was thinking we could use time intervals before, but either of these times might be unknown).

Things that are running through my head are artifacts that are streams or sockets -- the artifact might not yet be completely 'generated' but still be in the provenance chain for other artifacts.

Besides that, I agree with Jim's treatment.

Patrick Paulson


A fun case to consider!

What would it mean if B depends on A and A wasn't complete when B was finished? Do we lose something important? Not only does A now have creation start/end dates, but the identifier A now represents a range of things in terms of bytes, violating our identifiers identify static things policy.

On the other hand, if B is a stream that really depends on the last ten values of whatever is streaming by in A, do we really want to have to identify all of A's subelements as arguments to a process (and have processes with a growing number of arguments if B was, for example, the running total of the elements of A). Do we need arrays?

I guess my thought here would be to make the same model the sub-elements and subprocesses argument to keep the core model clean and then perhaps ask if we need a syntax to reduce the size of the provenance trace for repetitive processes of certain types (thinking that such a syntax shortcut might not cover all potential use cases, whereas having a stream/array notion in the model would have to be general...)

but - long thread, long day...

Cheers,

Jim

-- JimMyers- 14 Aug 2007


Gosh, all I was thinking was that the 'used' relationship would have a 'begin' and 'end' time on it, as would the 'generatedby' relationship. Lets say that A is source document, P is process, and B is generated document.

If generatedBy(B,P,startB,endB) and used(P,A,startA,endA) and endB < startA then maybe we can infer not(derivedFrom(A,B)).

Patrick Paulson, 14 Aug 2007


An element, which Juliana and I debated today, to add to the discussion:

If causality is the thing we care about, read time of an artifact has no interest. An artifact being timeless (by artifact definition), the time it is read cannot have a causal effect.

Can we argue for other notions of time (artifact creation time, process start time, process end time, what ever else time) they have causal implications?

Thoughts? Comments?

-- LucMoreau - 15 Aug 2007


Good point - perhaps read time is redundant in the same manner as the send/receive time on messages in the PASOA model - knowing the claimed read time could serve as a check on the genereatedBy time assertions by the previous process... so maybe that makes like the notions of time you state below - all have relations to the generatedBy time (which I think is pretty clearly linked to causality).

Would it work to just require a generatedBy time and consider all the rest outside the model but that a consistent account must obey certain rules (creationtime >= generatedBy >=usedtime>=processStartTime, etc.) - similarly we might have a rule that a consistent account that reports a dc:creator annotation of an artifact must show the same entity as an agent or some such...

> Can we argue for other notions of time (artifact creation time, process start time, process
> end time, what ever else time) they have causal implications?

Good question - I can't think of an answer here but I can certainly see all of these notions of time as very useful but perhaps not different than other useful types of annotations like dc:creator, mimetype, etc.

Jim

-- JimMyers - 16 Aug 2007


Yes, if time is causing problems, we could leave it out of the model for now, and rely on 'timeless artifacts'--where artifact includes the content.

(So in order to say that the content from a stream was used by a process, the implementation would have to create an artifact that represented both the stream and the times of access -- whatever the implementation would need to later identify what happened...but all of this can be left out of the provenance model)

Patrick Paulson, 16 Aug 2007


Good point - I guess even generationTime is optional in a timeless causality sense. In practice I think that is too extreme, but I can see that there's a sense in which even this time is just an annotation as far as the core model is concerned.

-- JimMyers - 16 Aug 2007


Not sure I understand what this adds and how it addresses streams - echoing your example: > generatedBy(B,P,timeB) and used(P,A,timeA) and timeB <
> timeA and we can still infer not(derivedFrom(A,B)).

but where does streaming come in - are A and B streams? Is start and end needed somehow because they are streams?

-- JimMyers - 16 Aug 2007


Say A is a stream and B is not. Note that even if the artifacts are timeless, they're content is not -- imagine we're streaming video from the web. B can be dependent on A, even if A is not completey generated, P has not terminated, and P has not completed 'using' A. B can not be dependent on A, however, if P finished generating B before it started using A.

In order to make this inference, we need at least a 'finished generating' time and a 'started using' time.

So in your example -generatedBy(B,P,timeB) and used(P,A,timeA)

If timeB represents 'finished generating' timeA is 'started using', then we can make the inference--and for right now I don't see an immediate use for 'started generating' and 'stopped using'.

But ... We are going to want to create the 'used' property before P has stopped using A, because P might be generating artifacts that might be dependent on A. Likewise, we want to know that P is generating B before it has completed generating it, because B might be involved in additional causal chains.

Patrick Paulson, 16 Aug 2007


Darn streams! I worry that this kind of construct allows us to define A as 'the contents of temp.out' which will change as the process runs and have B depend on it. I don't think we want this, but I don't know how we stop it if a streaming A is allowed unless we try to define some sort of rule that whatever part of A B depends on can't change after B is generated...

Jim


In Section 2, artifacts are defined as an "instance" of an object's state. This is unclear - instance of what class/set? Do you mean "instant"? If so, why is it required that it be at an instant? It seems this may be an impractical or inconvenient constraint. An unstated requirement may be that causal relationships are unambiguous: it is always clear what the state of an object was when it caused/was effected by something else. But to meet this requirement, wouldn't a period of the object's existence with no change be acceptable?

Jim: > Darn streams! I worry that this kind of construct allows us to
> define A as 'the contents of temp.out' which will change as the
> process runs and have B depend on it. I don't think we want this,
> but I don't know how we stop it if a streaming A is allowed unless
> we try to define some sort of rule that whatever part of A B
> depends on can't change after B is generated..

Considering the case of a decomposable workflow, such as "add1ToAll" in Figure 4, one artifact (e.g. 6) is generated part-way through and another (e.g. (3,7)) at the end. We do not treat the workflow as a single process when describing how the first artefact was generated (6 is not apparent on the left-hand figure). If we did we would have a cycle: the process both caused and was caused by the first artefact.

This seems to me to be the same case as with streams. If a process reads a stream and uses data from it at time 1, then later reads the same stream at time 2, receiving different data, then there is in fact two sub-processes. The stream is, in the terminology of the document, an "object" not an "artifact".

-- SimonMiles - 17 Aug 2007
to top


End of topic
Skip to action links | Back to top

You are here: Challenge > OPM > ModelWorkshop > ModelDiscussionTime

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.