Provenance Challenge 3 Questions
There are two sets of queries: core queries and optional queries. The core queries are those that all teams should try to answer. Optional queries are additional provenance questions that teams may answer. If a team has any new queries, please add these to the optional queries section.
Core Queries
Query 1
For a given detection, which CSV files contributed to it?
Basic sample answer: The CSV file containing the Detection table.
Advanced sample answer: The CSV file containing the Detection table, CSV file containing the Image table (as the image is an attribute of the detection), and CSV file containing the FrameMetadata table (as the frame metadata is an attribute of the image).
Query 1 Detailed
This is a more specific detailing of Query 1 what the query is, what results are to be expected and what evidence exists that provenance was determined.
Query:
For a given detection described in the database and identified by a detection id, which CSV files contributed to that description? By description, we mean the rows in the database related to a detection with a given detection id.
1. Inputs
The inputs to this query are a detection id and a database. A detection id maps to the DETECTID column in the P2DETECTION table inside a given database.
detection id: 261887437030025144
database: J062941_LoadDB
2a. Answer
The file name of the P2Detection CSV file that contained the detection id.
2b. Advanced Answer
- The file name of the P2Detection CSV file containing the detection id,
- the file name of the
ImageMeta? CSV file containing the image id (IMAGEID) associated to the detection.
- the file name of the
FrameMeta? CSV file containing the frame id (FRAME ID) that is associated with the image id that is in turn associated with the detection.
The following are just examples of how queries could be answered via the traversal of a provenance graph. Systems may answer these queries using a different approach.
3a. Evidence of provenance traversal
We would want to see the following:
- the detection with the given id is described by a row, R, in the P2DETECTION table in the database.
- R was caused by modifications of the P2DETECTION table by the UpdateComputedColumn? component.
- the invocation of UpdateComputedColumn? was caused by the existence of a CSVFileEntry?
- Additionally, R was caused by LoadCSVFileIntoTable? component
- the invocation of LoadCSVFileIntoTable? was cause by the existence of a CSVFileEntry?, the same one in step 2.
- The CVSFileEntry? was caused by the existence of a file name
3b. Evidence of provenance traversal
- the detection with the given id is described by a row, R, in the P2DETECTION table in the database.
- row R contains a foreign key IMAGE ID that points to a row, I, in the P2IMAGEMETA table.
- row I in turn contains a foreign key FRAME ID that points to a row, F, in the P2FRAMEMETA table.
- then for each of the rows, R, I, F we would want to see the same evidence as in the answer to 3a.
4a and 4b below are a particular
OPM example that describe evidence of a provenance traversal.
4a. Evidence of provenance traversal
- the detection with the given id is described by a row, artifact R, in the P2DETECTION table in the database.
- R wasGeneratedBy modifications of the P2DETECTION table by the UpdateComputedColumn? process.
- R wasDerivedFrom R^. R wasDerivedFrom a CVSFileEntry? artifact, F
- the invocation of UpdateComputedColumn? Used artifact F and artifact R^
- Additionally, artifact R^ wasGeneratedBy LoadCSVFileIntoTable? process 6. the invocation of LoadCSVFileIntoTable? process Used artifact F
- R^ wasDerivedFrom F
- F wasDerivedFrom a file name artifact
4b. Evidence of provenance traversal
- the detection with the given id is described by a row, artifact R, in the P2DETECTION table in the database.
- R wasDerivedFrom a foreign key IMAGE ID, which wasDerivedFrom a row, artifact I, in the P2IMAGEMETA table.
- row I in turn wasDerivedFrom a foreign key FRAME ID that wasDerivedFrom a row, artifact F, in the P2FRAMEMETA table.
- then for each of the artifacts, R, I, F we would want to see the same evidence as in the answer to 4a.
Query 2
The user considers a table to contain values they do not expect. Was the range check (IsMatchTableColumnRanges) performed for this table?
Sample answer: Yes
Query 3
Which operation executions were strictly necessary for the Image table to contain a particular (non-computed) value?
Sample answer: call of ReadCSVReadyFile, call of CreateEmptyLoadDB, 2nd call of ReadCSVFileColumnNames, 2nd call of LoadCSVFileIntoTable (2nd calls because Image is loaded in the 2nd iteration of the for loop, excluded checks because they do not change anything, excluded UpdatedComputedColumns because it is non-computed, excluded CompactDatabase because it does not affect the value).
Query 4: "Where Provenance" See Query 1
_This entry in the database seems incorrect. Identify the csv file (and location) where the data occurs._
This query is similar to Query 1. So we have removed it as a core query.
Query 5 (Based on new Plotting Workflow)
For a given zoneGroup
the user selects in the plot image in the DisplayPlot activity, find the CSV files that contains that data?
The P2Detection CSV file used to load the P2Detection table on which the query for the plot data was run. E.g. If the user selects
zoneGroup
1108 in the plot for Job ID J062942, the query returns P2_J062942_B001_P2fits0_20081115_P2Detection.csv.
Optional Queries
Optional Query 1
The workflow halts due to failing an IsMatchTableColumnRanges check. How many tables successfully loaded before the workflow halted due to a failed check?
Sample answer: 2
Optional Query 2
Which pairs of procedures in the workflow could be swapped and the same result still be obtained (given the particular data input)?
Sample answer: (I won't enumerate them all, but I think some can be swapped as the checks in particular are not causally dependent, but we cannot swap those inside the loop with those outside).
Optional Query 3
A CSV or header file is deleted during the workflow's execution. How much time expired between a successful IsMatchCSVFileTables test (when the file existed) and an unsuccessful IsExistsCSVFile? test (when the file had been deleted)?
Sample answer: 3ms
For testing the above query, we it may be simplest to edit the workflow to include deletion of the CSV file as a step.
Optional Query 4: "Why Provenance"
Why is this entry in the database?
Optional Query 5
A user executes the workflow many times (say 5 times) over different sets of data (j062941, j062942, ... j062945). He wants to determine, which of the execution halted?
Sample answer: workflow_execution_id along with input_data_set_name (say j062941, j062942)
Optional Query 6
Determine the step where halt occured?
Sample answer: 2nd call of ReadCSVFileColumnNames or 3rd call of LoadCSVFileIntoTable, etc.
Optional Query 7
Determine data and associated granularities of the data being processed, when halt occured?
Sample answer: data_id and associated granularities(column, row, table, file)
Optional Query 8
Which steps were completed successfully before the halt occurred?
Sample answer: steps name along with call (or count) number, if steps were executed multiple times
Optional Query 9
Which steps were not executed because of halt?
Sample answer: steps name along with call (or count) number, that should have been executed in case workflow did not halt
Optional Query 10
For a workflow execution, determine the user inputs?
Sample answer: jobID,
CVSRootPath?
Optional Query 11
For a workflow execution, determine steps that required user inputs?
Sample answer: IsCVSReadyFileExists?,
CreateEmptyLoadDB?
Optional Query 12
For a workflow execution that halted, which files where processed successfully?
Sample answer: P2Detection, P2FrameMeta
Optional Query 13
_For a workflow execution, display the following provenance views: data dependency views, step dependency view?
Sample answer: either display these views in form of a directed graph, or as a pair of dependency relations between data or steps.
Optional Query 14 (Based on new Plotting Workflow)
For a zoneGroup
the user selects in the plot image, find the ImageMeta? rows that correspond to the data used in the zoneGroup
.
Sample answer: If the user selects
zoneGroup
1108 in the plot for Job ID J062942, the query will return the result of
==SELECT distinct i.imageID FROM P2Detection AS d, P2ImageMeta AS i
WHERE d.imageID = i.imageID and CEILING(zoneId/10) = 1108==
Optional Query 15 (Based on new Plotting Workflow)
A user plots the results of both _DetectionsHistogram and
DetectionsHistogramHiQuality for a database and upon comparing the plots, selects a
zoneGroup
. Find all rows in the
P2Detection
table that was present in the result of
DetectionsHistogram? but not in the result of
DetectionsHistogramHiQuality?._
Sample answer: If the user selects
zoneGroup
1108 in the plot for Job ID J062942, the query will return the set of rows that are present in the result of query (A) but not in the result of query (B):
(A)
SELECT * FROM P2Detection WHERE ceiling(zoneId/10) = 1108
(B)
SELECT * FROM P2Detection WHERE raErr < 0.1 and decErr < 0.05 and ceiling(zoneId/10) = 1108
--
LucMoreau - 19 Mar 2009
--
PaulGroth - 02 Apr 2009
--
ManishAnand - 03 Apr 2009
--
YogeshSimmhan - 09 Apr 2009
to top