Provenance Challenge: Provenance-Aware Storage Systems (PASS)
Participating Team
Team and Project Details
- Short team name: PASS
- Participant names: Uri Braun, David Holland, Peter Macko, Diana MacLean, Daniel Margo, Kiran-Kumar Muniswamy-Reddy, Margo Seltzer, Robin Smogor
- Project URL: http://www.eecs.harvard.edu/syrah/pass
- Project Overview: PASS stands for Provenance-Aware Storage Systems and refers to systems (in our case file systems) that treat provenance as a first class object, collecting it, maintaining it, and querying it automatically. The second PASS prototype that we use for this Challenge is implemented as a set of Linux kernel modules and file system that automatically capture provenance as users interact with the system as they normally do. Therefore, capturing provenance requires no specialized workflow engines or other special-purpose software. PASS captures provenance for any program that runs on Linux 2.6.
- Relevant Publications:
- Muniswamy-Reddy, K., Braun, U., Holland, D., Macko, M., MacLean, D., Margo, D., Seltzer, M., and Smogor, R., Layering in Provenance Systems, Proceedings of the 2009 USENIX Annual Technical Conference, San Diego, CA, June 2009.
- Muniswamy-Reddy, K., Holland, D., Braun, U., Seltzer, M., Provenance-Aware Storage Systems, Proceedings of the 2006 USENIX Annual Technical Conference, Boston, MA, June 2006.
- Holland, D., Braun, U., MacLean, D., Muniswamy-Reddy, K., and Seltzer, M., Choosing a Data Model and Query Language for Provenance. Proceedings of the 2nd International Provenance and Annotation Workshop, Salt Lake City, UT, Jun 2008.
Workflow Representation
The workflow is represented as a Bash script that executes a modified version of the supplied Java classes. The script mirrors the supplied .bat files with the exception that some command-line arguments are passed directly instead of serialized Java object. For example, we use "--job J062941" instead of "-f JobIDInput.xml". We modified the Java classes to use our version of SQLite instead of Derby, which tracks provenance at the cell level granularity.
Open Provenance Model Output
XML-formatted
OPM:
J062941_v2.opm
We use the following naming conventions:
- Files: full path of the file
- Processes: the name of the executable
- Database cells: table_name row_id:column_name
- Cells in a CSV file: row_id:column_id
There are also several nameless artifacts, which correspond to Unix pipes. The previous version of our XML-formatted
OPM export is
J062941.opm, in which the command-line arguments are a part of the process name instead of separate artifacts.
Query Results
The queries were written in our Path Query Language (PQL) and evaluated on the provenance graph before it was exported to
OPM. The version of PQL used for this Challenge uses the following edge labeling conventions: INPUT is a generic ancestry edge, WHERE denotes a where-provenance edge between two database cells, and CONTAINS represents a containment (modeled as an ancestry edge).
Core Query 1
select csv.NAME
from Provenance.% as db, db.CONTAINS as cell, cell.WHERE+.CONTAINS-OF as csv
where db.NAME glob "*/pc3.db"
and cell.TABLE = "P2Detection"
and csv.NAME glob "*.csv";
This query first finds the SQLite database file pc3.db (variable
db
) and then the set of all cells in table P2Detection (variable
cell
). The query then looks for all where-ancestors of the cells that originated from a CSV file.
Result:
"/challenge3/PC3/SampleData/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv"
Given a particular entry in the table, we can find where exactly it came from:
select csv.NAME, w.ROW, w.COLUMN
from Provenance.% as db, db.CONTAINS as cell, cell.WHERE+ as w, w.CONTAINS-OF as csv
where db.NAME glob "*/pc3.db"
and cell.TABLE = "P2Detection"
and cell.COLUMN = "peakFlux" and cell.ROW = "8"
and csv.NAME glob "*.csv";
Result:
{
"/challenge3/PC3/SampleData/J062941/P2_J062941_B001_P2fits0_20081115_P2Detection.csv"
"7"
"30"
}
That is, the particular value came from the row 7 and column 30 of the given .csv file (counting from 0).
Core Query 2
select count(X.NAME)
from Provenance.% as X
where X.NAME = "./PSLoadExecutable.sh" and X.TYPE = "PROC"
and X.ARG1 = "IsMatchTableColumnRanges" and X.ARGS glob "*-t P2Detection*";
This query searches for all invocations of IsMatchTableColumnRanges on table P2Detection. If the operation was executed, the count aggregate in the query returns a positive number. If it was not executed, the query result is 0.
Result:
1
Core Query 3
select X.ARG4, X.ARG6, X.ARG8, X.ARG10, X.ARG12, X.ARG14
from Provenance.% as db, db.CONTAINS as cell, cell.WHERE*.INPUT as X
where db.NAME glob "*/pc3.db"
and cell.TABLE = "P2Detection"
and cell.COLUMN = "imageID" and cell.ROW = "4"
and X.NAME glob "*/java";
The query identifies all processes that wrote a particular cell (in this example, fourth row of P2Detection, column imageID) and relevant command-line arguments, but does not check its inputs. The ancestry of the cell includes all previous processes that used the database, since the SQLite database file is both an input and an output of every previous workflow process. By including these ancestors, we would include operations that were not strictly necessary.
Result:
{
{
"LoadCSVFileIntoTable"
"IsLoadedCSVFileIntoTableOutput_FileEntry2.xml"
"CreateEmptyLoadDBOutput.xml"
"ReadCSVFileColumnNamesOutput_FileEntry2.xml"
"P2Detection"
}
}
Our system does not keep track of control flows that do not result in any data flow, unless we would modify /bin/bash to insert custom annotations. For example, there is no data flow from IsCSVReadyFileExists to LoadCSVFileIntoTable, so we do know whether a successful execution IsCSVReadyFileExists was strictly necessary for a particular cell to appear in the database.
Optional Query 1
select count(java.ARG10) - 1
from Provenance.% as java
where java.NAME glob "*/java"
and java.ARG4 = "IsMatchTableColumnRanges";
In this query, we just count the number of invocations of IsMatchTableColumnRanges. According to the workflow specification, if we know that the last execution of IsMatchTableColumnRanges failed, the number of correctly loaded tables is just the number of successful invocations of IsMatchTableColumnRanges (which is the total number of invocations minus one).
Result:
2
Optional Query 3
select max(X.FREEZETIME) from Provenance.% as X where X.ARG4 = "IsExistsCSVFile";
select max(X.FREEZETIME) from Provenance.% as X where X.ARG4 = "IsMatchCSVFileTables";
We answer this query by getting the timestamps associated with the last executions of IsExistsCSVFile and IsMatchCSVFileTables, and computing their difference.
Result:
1239499744.640488611 - 1239499740.764267717 = 3.88 seconds
Optional Query 4
select I.NAME
from Provenance.% as db, db.CONTAINS as cell, cell.INPUT+ as I
where db.NAME glob "*/pc3.db"
and cell.TABLE = "P2Detection"
and cell.COLUMN = "peakFlux" and cell.ROW = "8";
This query returns the entire ancestry graph of a particular cell (row 8, column peakFlux of P2Detection), which in
PASS is equivalent to the why-provenance.
Result:
(omitted for brevity)
Optional Query 6
First, we find the timestamp of the last execution of a workflow operator.
select max(X.FREEZETIME) from Provenance.% as X where X.ARG0 = "./PSLoadExecutable.sh";
For example, if the result is "1239499744.632488155", we can then query again to find the actual process and its command-line arguments:
select X.ARGS
from Provenance.% as X
where X.ARG0 = "./PSLoadExecutable.sh"
and X.FREEZETIME = "1239499744.632488155";
Result:
"./PSLoadExecutable.sh IsExistsCSVFile -o IsExistsCSVFileOutput_FileEntry2.xml -f FileEntry2.xml"
Optional Query 8
The set of successfully executed steps is just the set of all executed steps minus the failed step. We get the set of all steps using the following query:
select X.ARGS from Provenance.% as X where X.ARG0 = "./PSLoadExecutable.sh";
After subtracting the failed step determined by the optional query 6, we get the following set (in no particular order):
{
"./PSLoadExecutable.sh CreateEmptyLoadDB -o CreateEmptyLoadDBOutput.xml --job J062941"
"./PSLoadExecutable.sh IsCSVReadyFileExists -o IsCSVReadyFileExistsOutput.xml --path /disk/disk1/challenge3/PC3/SampleData/J062941/"
"./PSLoadExecutable.sh IsExistsCSVFile -o IsExistsCSVFileOutput_FileEntry0.xml -f FileEntry0.xml"
"./PSLoadExecutable.sh IsExistsCSVFile -o IsExistsCSVFileOutput_FileEntry1.xml -f FileEntry1.xml"
"./PSLoadExecutable.sh IsMatchCSVFileColumnNames -o IsMatchCSVFileColumnNamesOutput_FileEntry0.xml
-f ReadCSVFileColumnNamesOutput_FileEntry0.xml"
"./PSLoadExecutable.sh IsMatchCSVFileColumnNames -o IsMatchCSVFileColumnNamesOutput_FileEntry1.xml
-f ReadCSVFileColumnNamesOutput_FileEntry1.xml"
"./PSLoadExecutable.sh IsMatchCSVFileTables -o IsMatchCSVFileTablesOutput.xml -f ReadCSVReadyFileOutput.xml"
"./PSLoadExecutable.sh IsMatchTableColumnRanges -o IsMatchTableColumnRangesOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml
-f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta"
"./PSLoadExecutable.sh IsMatchTableColumnRanges -o IsMatchTableColumnRangesOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml
-f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta"
"./PSLoadExecutable.sh IsMatchTableRowCount -o IsMatchTableRowCountOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml
-f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta"
"./PSLoadExecutable.sh IsMatchTableRowCount -o IsMatchTableRowCountOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml
-f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta"
"./PSLoadExecutable.sh LoadCSVFileIntoTable -o IsLoadedCSVFileIntoTableOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml
-f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta"
"./PSLoadExecutable.sh LoadCSVFileIntoTable -o IsLoadedCSVFileIntoTableOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml
-f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta"
"./PSLoadExecutable.sh ReadCSVFileColumnNames -o ReadCSVFileColumnNamesOutput_FileEntry0.xml -f FileEntry0.xml"
"./PSLoadExecutable.sh ReadCSVFileColumnNames -o ReadCSVFileColumnNamesOutput_FileEntry1.xml -f FileEntry1.xml"
"./PSLoadExecutable.sh ReadCSVReadyFile -o ReadCSVReadyFileOutput.xml --path /disk/disk1/challenge3/PC3/SampleData/J062941/"
"./PSLoadExecutable.sh SplitList -o FileEntry?.xml -f ReadCSVReadyFileOutput.xml"
"./PSLoadExecutable.sh UpdateComputedColumns -o IsUpdatedComputedColumnsOutput_FileEntry0.xml -f CreateEmptyLoadDBOutput.xml
-f ReadCSVFileColumnNamesOutput_FileEntry0.xml -t P2FrameMeta"
"./PSLoadExecutable.sh UpdateComputedColumns -o IsUpdatedComputedColumnsOutput_FileEntry1.xml -f CreateEmptyLoadDBOutput.xml
-f ReadCSVFileColumnNamesOutput_FileEntry1.xml -t P2ImageMeta"
}
Optional Query 10
select L.ARGS
from Provenance.% as L
where L.NAME = "./LoadWorkflow.sh" and L.TYPE = "PROC";
This query just returns all command-line arguments to LoadWorkflow.sh.
Result:
"./LoadWorkflow.sh J062941 /disk/disk1/challenge3/PC3/SampleData/J062941/"
Optional Query 11
select L.ARG1
from Provenance.% as L
where (L.ARGS glob "*--path*" or L.ARGS glob "*--job*") and L.NAME = "./PSLoadExecutable.sh";
In our implementation, we pass the two user inputs to other processes via command-line arguments --path and --job, so our query just looks for the processes with at least one of these arguments. Equivalently, given the results of the previous query, we could instead search for all processes with arguments "J062941" or "/disk/disk1/challenge3/PC3/SampleData/J062941/".
Result:
{
"CreateEmptyLoadDB"
"IsCSVReadyFileExists"
"ReadCSVReadyFile"
}
Suggested Workflow Variants
Suggested Queries
Query 1
A particular detection value seems wrong. However, the workflow, queries, and the CSV files are correct, so it is possible that the error is due to something external to the workflow engine. Which shared libraries were involved in computing a given value in the database?
select lib.NAME
from Provenance.% as db, db.CONTAINS as cell, cell.INPUT+ as lib
where db.NAME glob "*/pc3.db"
and cell.TABLE = "P2Detection"
and cell.COLUMN = "peakFlux" and cell.ROW = "8"
and lib.NAME glob "*.so*";
The query first finds the SQLite database file and then locates the provenance record that corresponds to the given cell. It then searches for all shared libraries within the ancestry of that cell.
Result:
{
"/challenge3/sqlite/java/libsqlite-java.so"
"/etc/ld.so.cache"
"/lib/libacl.so.1.1.0"
"/lib/libattr.so.1.1.0"
"/lib/libblkid.so.1.0"
"/lib/libdevmapper.so.1.02"
"/lib/libnss_mdns4_minimal.so.2"
"/lib/libselinux.so.1"
"/lib/libsepol.so.1"
"/lib/libuuid.so.1.2"
"/lib/tls/i686/cmov/libc-2.3.6.so"
"/lib/tls/i686/cmov/libdl-2.3.6.so"
"/lib/tls/i686/cmov/libm-2.3.6.so"
"/lib/tls/i686/cmov/libnsl-2.3.6.so"
"/lib/tls/i686/cmov/libnss_compat-2.3.6.so"
"/lib/tls/i686/cmov/libnss_dns-2.3.6.so"
"/lib/tls/i686/cmov/libnss_files-2.3.6.so"
"/lib/tls/i686/cmov/libnss_nis-2.3.6.so"
"/lib/tls/i686/cmov/libpthread-2.3.6.so"
"/lib/tls/i686/cmov/libresolv-2.3.6.so"
"/pmacko/pass/tools/challenge3/sqlite/java/libsqlite-java.so"
"/usr/local/java/jdk1.5.0_16/jre/lib/i386/libjava.so"
"/usr/local/java/jdk1.5.0_16/jre/lib/i386/libverify.so"
"/usr/local/java/jdk1.5.0_16/jre/lib/i386/libzip.so"
"/usr/local/java/jdk1.5.0_16/jre/lib/i386/native_threads/libhpi.so"
"/usr/local/java/jdk1.5.0_16/jre/lib/i386/server/libjvm.so"
}
Suggestions for Modification of the Open Provenance Model
Conclusions
--
PeterMacko - 22 May 2009
to top