SwiftPc3 < Challenge

Challenge.SwiftPc3

Provenance Challenge: Swift

Provenance Challenge: Swift

Participating Team

Team and Project Details

Short team name: Swift
Participant names: Ben Clifford (ci.uchicago.edu); Luiz M. R. Gadelha Jr. (cos.ufrj.br)
Project URL: http://www.ci.uchicago.edu/swift/
Project Overview: Swift is a system for the rapid and reliable specification, execution, and management of large-scale science and engineering workflows. It supports applications that execute many tasks coupled by disk-resident datasets - as is common, for example, when analyzing large quantities of data or performing parameter studies or ensemble simulations.
Relevant Publications:

Workflow Representation

There is a directory in the Swift repository at http://trac.ci.uchicago.edu/swift/browser/SwiftApps/pc3 where the Swift representation of the challenge workflow lives.

Open Provenance Model Output

The latest published OPM output for the challenge workflow is at http://www.ci.uchicago.edu/~benc/opm-20090519.xml

Older OPM output for the challenge workflow is at http://www.ci.uchicago.edu/~benc/opm-20090428.xml

Older more broken and less informative output: http://www.ci.uchicago.edu/~benc/opm-20090419.xml corresponding to the version of the workflow in SVN r2724

Query Results

core query 1: For a given detection, which CSV files contributed to it?

Swift does not record database provenance data, so we used annotations in the application database to record which process inserted or modified each row.

Suppose we want to determine the provenance of the detection that has the identifier 261887481030000003, the first query can be answered in the following manner:

obtain from the annotations in the application db the process responsible for inserting the detection in the application DB:

select
  provenanceid
from
  ipaw.p2detectionprov
where
  detectid = 261887481030000003;

tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090507-1008-q4dpcm28:ps_load_executable_db_app-b2bclgaj

obtain the respective execute_id:

select
  execute_id
from
  execute2s
where 
  id = 'tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090507-1140-z7ebbrz0:ps_load_executable_db_app-8d52pgaj';

tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090507-1140-z7ebbrz0:0-5-5-1-5-1-2-0

select, among the ancestors (both artifacts and processes) of the execution, the datasets that contain CSV inputs:

select
  filename
from
  trans, dataset_filenames
where
  after='tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090507-1140-z7ebbrz0:0-5-5-1-5-1-2-0' and
  before=dataset_id and filename like '%split%';

file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-3
file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-2
file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-1

the filenames returned contain references to the CSV files that will result in the detection id row insertion:

P2_J062941_B001_P2fits0_20081115_P2Detection.csv,
P2_J062941_B001_P2fits0_20081115_P2ImageMeta.csv,
P2_J062941_B001_P2fits0_20081115_P2FrameMeta.csv

core query 2: The user considers a table to contain values they do not expect. Was the range check (IsMatchTableColumnRanges^?) performed for this table?
q2.sh in the SVN runs this query.
The guts of the query is this SQL:
```
select 
  dataset_values.value
from
  processes, invocation_procedure_names,  dataset_usage, dataset_values
where
  type='compound' and
  procedure_name='is_match_table_column_ranges' and
  dataset_usage.direction='O' and
  dataset_usage.param_name='inputcontent' and
  processes.id = invocation_procedure_names.execute_id and
  dataset_usage.process_id = processes.id and
  dataset_usage.dataset_id = dataset_values.dataset_id;
```
This returns the input parameter XML for all is_match_table_column_ranges calls. These are XML values, and it is necessary to examine the resulting XML to determine if it was invoked for the specific table. There is unpleasant cross-format joining necessary here to get an actual yes/no result properly, although probably could use a LIKE clause to peek inside the value.
query 3: Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?
This uses the additional annotations made for query 1. These annotations only store which process originally inserted a row, not which processes have modified a row. So to some extent, rows are regarded a bit like artifacts (though not first order artifacts in the provenance database); and we can only answer questions about the provenance of rows, not the individual fields within those rows. That is sufficient for this query, though.
First find the row that contains the interesting value and extract its IMAGEID. Then find the process that created the image ID by querying the derby database table P2IMAGEPROV:
```
ij> select * from ipaw.p2imageprov where imageid=6294301;
IMAGEID             |PROVENANCEID                                                
-----------------------------------------------------------------------------------------------------------------------------------------------------
6294301             |tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090519-2057d8dyi9o9:ps_load_executable_db_app-dpc8q1bj
```
Now we have a process ID for the process that created the row.
Now query the transitive closure table for all predecessors for that process (as in q1). This will produce all processes and artifacts that preceeded this row creation.
Our answer differs from the sample answer because we have sequenced access to the db, rather than regarding each row as a proper first-order artifact. The entire DB state at a particular time is a successor to all previous database accessing operations, so any process which led to any database access before the row in question is regarded as a necessary operations. This is undesirable in some respects, but desirable in others. For example, a row insert only works because previous database operations which inserted other rows did not insert a conflicting primary key - so there is data dependency between the different operations even though they operate on different rows.
Optional query 1: The workflow halts due to failing an IsMatchTableColumnRanges^? check. How many tables successfully loaded before the workflow halted due to a failed check?
This counts how many load processes are known to the database (over all recorded workflows)
select count(*) from invocation_procedure_names where procedure_name='load_csv_file_into_table';
This can be restricted to a particular workflow run like this:
sqlite> select count(process_id) from invocation_procedure_names,processes_in_workflows where procedure_name='load_csv_file_into_table' and workflow_id='tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090519-1659-jqc5od2f:run' and invocation_procedure_names.execute_id = processes_in_workflows.process_id; 3
Optional Query 2: Which pairs of procedures in the workflow could be swapped and the same result still be obtained (given the particular data input)?
In our Swift representation of the workflow, we control flow dataflow dependencies. So many of the activities that could be commuted are in our implementation run in parallel. One significant thing can't describe in SwiftScript^? (and so cannot answer from the provenance database using this method) is commuting operations on the database. From a Swift perspective, this is a limitation of our SwiftScript^? language rather than in the provenance implementation, benc thinks.
The query lists which pairs unix process executions (of which there are 50x50) have no data dependencies on each other. There are 2082 rows. The base SQL query is this:
select L.id, R.id from processes as L, processes as R where L.type='execute' and R.type='execute' and NOT EXISTS (select * from trans where before=L.id and after=R.id);
This answer is deficient in a few ways. We do not take into account non-execute procedures (such as compound procedures, function invocations, and operator executions) - there are 253 processes in total, 50 being executes and the remaineder being the other kinds of process. If we did that naively, we would not take into account compound procedures which contain other procedures (due to lack of decent support for nested processes - something like OPM accounts) and would come up with commutations which do not make sense.

Suggested Workflow Variants

Suggested Queries

Suggestions for Modification of the Open Provenance Model

To v1.01.a.xml schema definition, make time be represented by dataTime XSD type. We already do this in our OPM above, meaning that it should not validate with v1.01.a.xml.

Better support for hierarchies - of collections of artifacts; and of processes. Swift makes heavy use of both in its input language, SwiftScript^?, though this information is not recorded explicitly in the provenance database.

Conclusions

to top

End of topic
Skip to action links | Back to top

You are here: Challenge > ThirdProvenanceChallenge > ParticipatingTeams3 > SwiftPc3

to top