Provenance Challenge: Swift
Participating Team
Team and Project Details
- Short team name: Swift
- Participant names: Ben Clifford (ci.uchicago.edu); Luiz M. R. Gadelha Jr. (cos.ufrj.br)
- Project URL: http://www.ci.uchicago.edu/swift/
- Project Overview: Swift is a system for the rapid and reliable specification, execution, and management of large-scale science and engineering workflows. It supports applications that execute many tasks coupled by disk-resident datasets - as is common, for example, when analyzing large quantities of data or performing parameter studies or ensemble simulations.
- Relevant Publications:
Workflow Representation
There is a directory in the Swift repository at
http://trac.ci.uchicago.edu/swift/browser/SwiftApps/pc3 where the Swift representation of the challenge workflow lives.
Open Provenance Model Output
The latest published
OPM output for the challenge workflow is at
http://www.ci.uchicago.edu/~benc/opm-20090519.xml
Older
OPM output for the challenge workflow is at
http://www.ci.uchicago.edu/~benc/opm-20090428.xml
Older more broken and less informative output:
http://www.ci.uchicago.edu/~benc/opm-20090419.xml corresponding to the version of the workflow in SVN r2724
Query Results
- core query 1: For a given detection, which CSV files contributed to it?
Swift does not record database provenance data, so we used annotations in the application database to record which process inserted or modified each row.
Suppose we want to determine the provenance of the detection that has the identifier 261887481030000003, the first query can be answered in the following manner:
- obtain from the annotations in the application db the process responsible for inserting the detection in the application DB:
select
provenanceid
from
ipaw.p2detectionprov
where
detectid = 261887481030000003;
tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090507-1008-q4dpcm28:ps_load_executable_db_app-b2bclgaj
- obtain the respective execute_id:
select
execute_id
from
execute2s
where
id = 'tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090507-1140-z7ebbrz0:ps_load_executable_db_app-8d52pgaj';
tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090507-1140-z7ebbrz0:0-5-5-1-5-1-2-0
- select, among the ancestors (both artifacts and processes) of the execution, the datasets that contain CSV inputs:
select
filename
from
trans, dataset_filenames
where
after='tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090507-1140-z7ebbrz0:0-5-5-1-5-1-2-0' and
before=dataset_id and filename like '%split%';
file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-3
file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-2
file://localhost/_concurrent/split_list_output-65fe229c-2da2-4054-997e-fb167b8c30ed--array//elt-1
- the filenames returned contain references to the CSV files that will result in the detection id row insertion:
P2_J062941_B001_P2fits0_20081115_P2Detection.csv,
P2_J062941_B001_P2fits0_20081115_P2ImageMeta.csv,
P2_J062941_B001_P2fits0_20081115_P2FrameMeta.csv
- core query 2: The user considers a table to contain values they do not expect. Was the range check (IsMatchTableColumnRanges?) performed for this table?
q2.sh in the SVN runs this query.
The guts of the query is this SQL:
select
dataset_values.value
from
processes, invocation_procedure_names, dataset_usage, dataset_values
where
type='compound' and
procedure_name='is_match_table_column_ranges' and
dataset_usage.direction='O' and
dataset_usage.param_name='inputcontent' and
processes.id = invocation_procedure_names.execute_id and
dataset_usage.process_id = processes.id and
dataset_usage.dataset_id = dataset_values.dataset_id;
This returns the input parameter XML for all is_match_table_column_ranges calls. These are XML values, and it is necessary to examine the resulting XML to determine if it was invoked for the specific table. There is unpleasant cross-format joining necessary here to get an actual yes/no result properly, although probably could use a LIKE clause to peek inside the value.
- query 3: Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?
This uses the additional annotations made for query 1. These annotations only store which process originally inserted a row, not which processes have modified a row. So to some extent, rows are regarded a bit like artifacts (though not first order artifacts in the provenance database); and we can only answer questions about the provenance of rows, not the individual fields within those rows. That is sufficient for this query, though.
First find the row that contains the interesting value and extract its IMAGEID. Then find the process that created the image ID by querying the derby database table P2IMAGEPROV:
ij> select * from ipaw.p2imageprov where imageid=6294301;
IMAGEID |PROVENANCEID
-----------------------------------------------------------------------------------------------------------------------------------------------------
6294301 |tag:benc@ci.uchicago.edu,2008:swiftlogs:execute2:pc3-20090519-2057d8dyi9o9:ps_load_executable_db_app-dpc8q1bj
Now we have a process ID for the process that created the row.
Now query the transitive closure table for all predecessors for that process (as in q1). This will produce all processes and artifacts that preceeded this row creation.
Our answer differs from the sample answer because we have sequenced access to the db, rather than regarding each row as a proper first-order artifact. The entire DB state at a particular time is a successor to all previous database accessing operations, so any process which led to any database access before the row in question is regarded as a necessary operations. This is undesirable in some respects, but desirable in others. For example, a row insert only works because previous database operations which inserted other rows did not insert a conflicting primary key - so there is data dependency between the different operations even though they operate on different rows.
- Optional query 1: The workflow halts due to failing an IsMatchTableColumnRanges? check. How many tables successfully loaded before the workflow halted due to a failed check?
This counts how many load processes are known to the database (over all recorded workflows)
select count(*) from invocation_procedure_names where procedure_name='load_csv_file_into_table';
This can be restricted to a particular workflow run like this:
sqlite> select count(process_id) from invocation_procedure_names,processes_in_workflows where procedure_name='load_csv_file_into_table' and workflow_id='tag:benc@ci.uchicago.edu,2008:swiftlogs:execute:pc3-20090519-1659-jqc5od2f:run' and invocation_procedure_names.execute_id = processes_in_workflows.process_id;
3
- Optional Query 2: Which pairs of procedures in the workflow could be swapped and the same result still be obtained (given the particular data input)?
In our Swift representation of the workflow, we control flow dataflow dependencies. So many of the activities that could be commuted are in our implementation run in parallel. One significant thing can't describe in SwiftScript? (and so cannot answer from the provenance database using this method) is commuting operations on the database. From a Swift perspective, this is a limitation of our SwiftScript? language rather than in the provenance implementation, benc thinks.
The query lists which pairs unix process executions (of which there are 50x50) have no data dependencies on each other.
There are 2082 rows. The base SQL query is this:
select L.id, R.id from processes as L, processes as R where L.type='execute' and R.type='execute' and NOT EXISTS (select
* from trans where before=L.id and after=R.id);
This answer is deficient in a few ways. We do not take into account non-execute procedures (such as compound procedures, function invocations, and operator executions) - there are 253 processes in total, 50 being executes and the remaineder being the other kinds of process. If we did that naively, we would not take into account compound procedures which contain other procedures (due to lack of decent support for nested processes - something like OPM accounts) and would come up with commutations which do not make sense.
Suggested Workflow Variants
Suggested Queries
Suggestions for Modification of the Open Provenance Model
To v1.01.a.xml schema definition, make time be represented by dataTime XSD type. We already do this in our
OPM above, meaning that it should not validate with v1.01.a.xml.
Better support for hierarchies - of collections of artifacts; and of processes. Swift makes heavy use of both in its input language,
SwiftScript?, though this information is not recorded explicitly in the provenance database.
Conclusions
to top