Skip to topic | Skip to bottom

Provenance Challenge

Challenge
Challenge.PASS

Start of topic | Skip to actions

Provenance-Aware Storage Systems (PASS)

Participating Team

PASS stands for Provenance-Aware Storage Systems and refers to systems (in our case file systems) that treat provenance as a first class object, collecting it, maintaining it, and querying it automatically. The first PASS prototype that we use for this Challenge is implemented as a set of Linux kernel modules and file system that automatically capture provenance while users interact with the system as they normally do. Therefore, capturing provenance requires no specialized work flow engines or other special-purpose software. PASS captures provenance for any program that runs on Linux 2.4.

Muniswamy-Reddy, K., Holland, D., Braun, U., Seltzer, M., Provenance-Aware Storage Systems, Proceedings of the 2006 USENIX Annual Technical Conference, Boston, MA, June 2006.

Braun, U., Garfinkel, S., Holland, D., Muniswamy-Reddy, K., Seltzer, M., Issues in Automatic Provenance Collection Proceedings of the 2006 International Provenance and Annotation Workshop, Chicago, IL, May 2006.

Workflow Representation

As one of our goals is to avoid having to modify applications, we simply ran the provided shell script. (This shell script as downloaded included CRs (\r), which needed to be removed for it to work correctly. The tr process used for this may be seen in the provenance.)

The OS was not installed on a provenance-aware volume; our v1 prototype does not support that.

To reduce extraneous complications, the challenge was run on a fresh PASS volume immediately after rebooting, as an ordinary (non-root) user. The Unix process environment was pruned slightly in the interests of making the query output a little smaller.

We had some concerns about the license of the FSL package, so we used a fake "slicer" tool with the proper outputs compiled into it. Its behavior should be indistinguishable from that of the real slicer, although its own provenance shows that it's fake.

The updated workload for query 7 was run in a subdirectory, on the same original input files, with the pathnames in the workload script updated accordingly.

Provenance Trace

The PASS system does not exactly capture a persistent trace; the raw trace is recorded only in memory (inside the kernel) and what the kernel sends to disk in the provenance database is already a processed form.

Note: these files are rather large, so I've linked them on our site instead of uploading.

Our schema is described in the Usenix paper (referenced above); in quick summary, there are five tables ("databases" in Berkeley DB terminology) as follows:

The separate tables for command/environment strings avoid storing multiple copies of these strings, which often repeat.

The version found here is slightly extended from the version described in the Usenix paper, in that in addition to "pnode numbers", which identify versions of files on the provenance-aware volume, it also stores "subobject numbers" - these identify distinct non-file entities that existed in the kernel at runtime. These include processes, files from non-PASS volumes, and pipes. Subobject number 0 refers to the PASS file itself. (For obscure reasons, sometimes subobject 1 does as well; this appears in some of the query results.)

Tracking subobjects allows, more or less, full reconstruction of shell pipelines; it's a compile-time option in our system and can be changed by recompiling the kernel and all the PASS tools.

The best way to think of pnode numbers and subobject numbers, however, is to not worry about the oddities of our internal representation, and just remember that the pair (1.2 in the example below) identifies some type of provenanced entity.

The way to read the pass.db dump is as follows:

1 2 PID -> 1004

means that this record describes pnode 1, subobject 2; the record has attribute "process id" and value 1004.

These are the attributes in PASS v1:

ANNOTATION attr -> value
user annotation
ARGUMENTS -> num
strings from argv[]; look up the code in the cmdenvrecno dump to find the strings
ENVIRONMENT -> num
Unix process environment strings; ditto
FREEZETIME -> num
time at which this particular entity was "finished"
IFLOW -> num
input crossreference to same pnode and given subobject
INPUT_FILE -> num
input crossreference to given pnode and subobject 0
IPREV -> num
previous version crossreference to same pnode and given subobject
KERNEL -> num
kernel information strings; look up the code in the cmdenvrecno dump to find the strings
MODULE -> num
kernel modules; ditto
Name -> string
name of file relative to FS root directory
OPEN_NAME -> string
name of non-pass file or pipe ("/|" indicates a pipe)
PID -> num
process id
PREVIOUS_VERSION -> num
previous version crossreference to given pnode and subobject 0
PROC_NAME -> string
name of process from execve()

Provenance Queries

Our project has concentrated primarily on automatic provenance collection -- our goal is to automatically collect useful provenance data and store it in a fashion such that it can be used by many query facilities, perhaps application-specific, all with potentially different data models.

The query tool we are using for the challenge was meant to be simple but comprehensive, and it is functional; but it's also somewhat primitive and we didn't spend time on pretty-printing. Our goal is to demonstrate that the information one wishes to extract from a provenance system is available from a system that does automatic collection.

This query tool is called nq; it accepts a vaguely SQL-like query syntax, which allows recursive searches going either up (ancestors) or down (descendents) the provenance graph, as well as predicate expressions for choosing subsets of the objects found. It treats each distinct version of each provenanced entity thus found as a row, and lets you choose which columns to display.

It does not directly support predicates based on the (non)existence of ancestry relationships, but the same effect can be accomplished by conducting multiple searches with suitable constraints on each.

It can output its reports in HTML instead of text, which offers a small increment in legibility. HTML versions of the two largest results are linked below.

It can also output simple graph descriptions to feed through graphviz; our experience has been that resulting images tend to be prohibitively large... except when the graphs are too large for graphviz to handle at all.

Query 1:

Find the process that led to Atlas X Graphic / everything that caused Atlas X Graphic to be as it is. This should tell us the new brain images from which the averaged atlas was generated, the warping performed etc.

nq 'ancestors atlas-x.gif report'

This means "find all ancestors of the file atlas-x.gif, without limitation, and print the results in report form."

If you don't have the live filesystem, nq cannot translate the filename to an inode number and thence to a pnode number. PASS was meant for live use, including querying; however, particularly during development, offline queries are useful too. In the case of the databases linked above, the pnode number is 922, and the subobject number is 0; thus you can also write

nq 'ancestors 922.0 report'

The results for this query are rather large (~5M), so I have linked them on our site:

To explain the format and the meaning, I'll paste the first three objects from the result and discuss them:

922.0 [passfile; challenge/atlas-x.gif] version 1
    type: passfile
    name: challenge/atlas-x.gif
    input: 922.2 [proc; pid 2937; /usr/local/bin/convert] version 0
    annotation: dim=x
    annotation: run=base
    annotation: studyModality=mindreading

The heading line gives the pnode and subobject number that unambiguously names this version of this object, as well as the type and the name. PASS files (which are on PASS volumes) are distinguished from non-PASS files (which aren't) because while the system may have some data on the latter, full provenance is not available. The name of a PASS file is, specifically, the path from the root of its volume when it was created. It is not guaranteed to be current, and even if it is, the file presently available under the same name might be a later version.

Note that this object is version 1, and even though versions are numbered starting at 0, you will find no version 0 anywhere in the query result. The query tool folds together versions whose independent existence does not contribute to reporting the data flow into and out of an object. (For example, if an intermediate version has no inputs and outputs of its own, just a previous and next version, the query tool will erase it.) This makes the output much more concise than it otherwise would be. Uninteresting versions can arise when a query takes a subset of the complete provenance graph; they also can arise in the database itself as a result of certain circumstances during execution.

The other lines give attributes and values. In this case, the file has one input, the process 922.2, which had Unix process id 2937 and was named /usr/local/bin/convert. The three user annotations pertain to query 9; see below.

The next object (the objects are in this case printed in reverse topological/chronological order, so the oldest and most distant file is at the bottom) is that input, process 922.2:

922.2 [proc; pid 2937; /usr/local/bin/convert] version 0
    type: proc
    pid: 2937
    name: /usr/local/bin/convert
    argv[0]: convert
    argv[1]: atlas-x.pgm
    argv[2]: atlas-x.gif
    input: 922.10 [nonpass; /usr/lib/libfreetype.so.6] version 0
    input: 441.280 [nonpass; /lib/libdl.so.2] version 0
    input: 905.3 [nonpass; /lib/i686/libm.so.6] version 0
    input: 1.3 [nonpass; /lib/i686/libc.so.6] version 0
    input: 361.259 [nonpass; /usr/share/locale/locale.alias] version 0
    input: 441.282 [nonpass; /etc/mtab] version 5
    input: 441.283 [nonpass; /proc/meminfo] version 0
    input: 922.17 [nonpass; /usr/local/lib/ImageMagick-6.2.8/modules-Q16/coders/pnm.la] version 0
    input: 922.18 [nonpass; /usr/local/lib/ImageMagick-6.2.8/modules-Q16/coders/pnm.so] version 0
    input: 922.19 [nonpass; /usr/local/lib/ImageMagick-6.2.8/modules-Q16/coders/gif.la] version 0
    input: 922.20 [nonpass; /usr/local/lib/ImageMagick-6.2.8/modules-Q16/coders/gif.so] version 0
    input: 922.3 [nonpass; /usr/local/lib/libMagick.so.10] version 0
    input: 922.4 [nonpass; /usr/local/lib/libWand.so.10] version 0
    input: 922.5 [nonpass; /usr/lib/libtiff.so.3] version 0
    input: 922.6 [nonpass; /usr/lib/libjpeg.so.62] version 0
    input: 922.7 [nonpass; /usr/lib/libbz2.so.1] version 0
    input: 922.8 [nonpass; /usr/local/lib/libz.so.1] version 0
    input: 922.9 [nonpass; /lib/i686/libpthread.so.0] version 0
    input: 919.0 [passfile; challenge/atlas-x.pgm] version 1
    output: 922.0 [passfile; challenge/atlas-x.gif] version 1
    env: PWD=/pass/fs/challenge
    env: VENDOR=intel
    env: REMOTEHOST=tanaqui.eecs.harvard.edu
    env: HOSTNAME=kastchei
    env: LESSOPEN=|/usr/bin/lesspipe.sh %s
    env: USER=dholland
    env: MACHTYPE=i386
    env: MAIL=/var/spool/mail/dholland
    env: LANG=en_US.iso885915
    env: HOST=kastchei
    env: LOGNAME=dholland
    env: SHLVL=3
    env: GROUP=root
    env: SUPPORTED=en_US.iso885915:en_US:en
    env: SHELL=/bin/tcsh
    env: PRINTER=cork
    env: HOSTTYPE=i386-linux
    env: OSTYPE=linux
    env: HOME=/home/dholland
    env: TERM=xterm
    env: PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin
    env: _=/usr/local/bin/convert
    kernel: Linux 2.4.29+autoprov #174 Sun Sep 10 22:32:46 EDT 2006
    module: pasta                  58568   0 (autoclean) (unused)
    module: kbdb                  610048   0 (autoclean) [pasta]
    module: 3c59x                  28680   1
    module: ipchains               47756  15
    module: aic7xxx               152064   4
    module: sd_mod                 12636   8
    module: scsi_mod              103216   2 [aic7xxx sd_mod]

This object has quite a few more attributes, which is not surprising; processes are where the action is. It has a name (taken from execve), a process id, some argument strings, and a whole pile of inputs. We also recorded the complete process environment, the kernel identification string, and the list of loadable modules currently resident. pasta is the PASS file system; kbdb is the kernel-side Berkeley DB we use for storing provenance.

Note that while one of the inputs is the official input from the workload (atlas-x.pgm) the rest are assorted files from the OS or other software packages. These are just as much part of the provenance of the output file as the official workload is.

Note that because the OS is not installed on a PASS volume, these files are not themselves provenanced. If the OS were, as is planned for PASS v2, then you would be able to get their full history all the way back to the OS install.

The third object in the output is one of these non-PASS files, a shared library that's part of ImageMagick?:

922.3 [nonpass; /usr/local/lib/libMagick.so.10] version 0
    type: nonpass
    name: /usr/local/lib/libMagick.so.10
    output: 922.2 [proc; pid 2937; /usr/local/bin/convert] version 0

We know it's part of ImageMagick? because one can recognize the name; unfortunately because it's non-PASS, it's not actually provenanced.

If you look through the query results, you'll see the full compile of the AIR tools, which was done on the PASS volume.

Query 2:

Find the process that led to Atlas X Graphic, excluding everything prior to the averaging of images with softmean.

There are at least two ways one could interpret the precise meaning of the point defined by "the averaging of images with softmean"; we chose to cut at the softmean processes themselves. We do this using a concept called an "anchor", which serves as a termination point for a recursive search. The full query:

nq 'ancestors atlas-x.gif anchor (type == "proc" && name == "AIR5.2.5/bin/softmean") report'

which means "find the ancestors of atlas-x.gif, anchoring (not searching any deeper) upon reaching any process whose name matches the softmean executable; display the results in report form."

The output for this query is not huge like Q1, but it's still a bit too large for a wiki page, so once again I've linked it:

It is very similar to the first part of the Q1 output. It isn't identical, because objects that are not PASS files are in general named multiple pnode and subobject number pairs, and which ends up in the query output depends on circumstances.

The graphic is a representation of the relationships of the objects in the Q2 result. Ellipses are files; boxes are processes. Atlas X Graphic is at the bottom. It has been shrunk to a vaguely reasonable size for a browser; the purpose in this case is to illustrate the general shape, not exhibit details.

Query 3:

Find the Stage 3, 4 and 5 details of the process that led to Atlas X Graphic.

Because our system does not have explicit specification of workloads, there is also no explicit specification of stages. So there is no way to name stages 3, 4, and 5 as such; instead one can search by bounding the query.

This becomes the same as query 2.

(While one could conceivably instead write a query to identify each stage, extract the objects involved, and then merge the lot together into a single report, it would end up being an expensive way to get the same output. Also, =nq='s query language would have to be extended to allow taking unions of searches.)

Query 4:

Find all invocations of procedure align_warp using a twelfth order nonlinear 1365 parameter model (see model menu describing possible values of parameter "-m 12" of align_warp) that ran on a Monday.

This is a search on just attributes:

nq 'everything where basename == "align_warp" && concat(argv) ~ "*-m 12*" && freezetime ~ "*Mon*" report'

The basename field gives the non-path part of the object name; the tilde operator does shell glob matches. (Not, for now at least, regexps.)

The freezetime field reports when a particular version of an object was "finished". For further details, see the published papers. When matched against a glob, it's converted to a string using the %c format of strftime.

Since in this particular case everything was run on Monday, this query turns up eight align_warp invocations, four from the main workload and four from the variant workload used in query 7.

Results:


931.2 [proc; pid 2951; ../AIR5.2.5/bin/align_warp] version 0
    type: proc
    pid: 2951
    name: ../AIR5.2.5/bin/align_warp
    argv[0]: ../AIR5.2.5/bin/align_warp
    argv[1]: ../anatomy4.img
    argv[2]: ../reference.img
    argv[3]: warp4.warp
    argv[4]: -m
    argv[5]: 12
    argv[6]: -q
    env: PWD=/pass/fs/challenge/q7
    env: VENDOR=intel
    env: REMOTEHOST=tanaqui.eecs.harvard.edu
    env: HOSTNAME=kastchei
    env: LESSOPEN=|/usr/bin/lesspipe.sh %s
    env: USER=dholland
    env: MACHTYPE=i386
    env: MAIL=/var/spool/mail/dholland
    env: LANG=en_US.iso885915
    env: HOST=kastchei
    env: LOGNAME=dholland
    env: SHLVL=3
    env: GROUP=root
    env: SUPPORTED=en_US.iso885915:en_US:en
    env: SHELL=/bin/tcsh
    env: PRINTER=cork
    env: HOSTTYPE=i386-linux
    env: OSTYPE=linux
    env: HOME=/home/dholland
    env: TERM=xterm
    env: PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin
    env: _=../AIR5.2.5/bin/align_warp
    kernel: Linux 2.4.29+autoprov #174 Sun Sep 10 22:32:46 EDT 2006
    module: pasta                  58568   0 (autoclean) (unused)
    module: kbdb                  610048   0 (autoclean) [pasta]
    module: 3c59x                  28680   1
    module: ipchains               47756  15
    module: aic7xxx               152064   4
    module: sd_mod                 12636   8
    module: scsi_mod              103216   2 [aic7xxx sd_mod]

930.2 [proc; pid 2950; ../AIR5.2.5/bin/align_warp] version 0
    type: proc
    pid: 2950
    name: ../AIR5.2.5/bin/align_warp
    argv[0]: ../AIR5.2.5/bin/align_warp
    argv[1]: ../anatomy3.img
    argv[2]: ../reference.img
    argv[3]: warp3.warp
    argv[4]: -m
    argv[5]: 12
    argv[6]: -q
    env: PWD=/pass/fs/challenge/q7
    env: VENDOR=intel
    env: REMOTEHOST=tanaqui.eecs.harvard.edu
    env: HOSTNAME=kastchei
    env: LESSOPEN=|/usr/bin/lesspipe.sh %s
    env: USER=dholland
    env: MACHTYPE=i386
    env: MAIL=/var/spool/mail/dholland
    env: LANG=en_US.iso885915
    env: HOST=kastchei
    env: LOGNAME=dholland
    env: SHLVL=3
    env: GROUP=root
    env: SUPPORTED=en_US.iso885915:en_US:en
    env: SHELL=/bin/tcsh
    env: PRINTER=cork
    env: HOSTTYPE=i386-linux
    env: OSTYPE=linux
    env: HOME=/home/dholland
    env: TERM=xterm
    env: PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin
    env: _=../AIR5.2.5/bin/align_warp
    kernel: Linux 2.4.29+autoprov #174 Sun Sep 10 22:32:46 EDT 2006
    module: pasta                  58568   0 (autoclean) (unused)
    module: kbdb                  610048   0 (autoclean) [pasta]
    module: 3c59x                  28680   1
    module: ipchains               47756  15
    module: aic7xxx               152064   4
    module: sd_mod                 12636   8
    module: scsi_mod              103216   2 [aic7xxx sd_mod]

929.2 [proc; pid 2949; ../AIR5.2.5/bin/align_warp] version 0
    type: proc
    pid: 2949
    name: ../AIR5.2.5/bin/align_warp
    argv[0]: ../AIR5.2.5/bin/align_warp
    argv[1]: ../anatomy2.img
    argv[2]: ../reference.img
    argv[3]: warp2.warp
    argv[4]: -m
    argv[5]: 12
    argv[6]: -q
    env: PWD=/pass/fs/challenge/q7
    env: VENDOR=intel
    env: REMOTEHOST=tanaqui.eecs.harvard.edu
    env: HOSTNAME=kastchei
    env: LESSOPEN=|/usr/bin/lesspipe.sh %s
    env: USER=dholland
    env: MACHTYPE=i386
    env: MAIL=/var/spool/mail/dholland
    env: LANG=en_US.iso885915
    env: HOST=kastchei
    env: LOGNAME=dholland
    env: SHLVL=3
    env: GROUP=root
    env: SUPPORTED=en_US.iso885915:en_US:en
    env: SHELL=/bin/tcsh
    env: PRINTER=cork
    env: HOSTTYPE=i386-linux
    env: OSTYPE=linux
    env: HOME=/home/dholland
    env: TERM=xterm
    env: PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin
    env: _=../AIR5.2.5/bin/align_warp
    kernel: Linux 2.4.29+autoprov #174 Sun Sep 10 22:32:46 EDT 2006
    module: pasta                  58568   0 (autoclean) (unused)
    module: kbdb                  610048   0 (autoclean) [pasta]
    module: 3c59x                  28680   1
    module: ipchains               47756  15
    module: aic7xxx               152064   4
    module: sd_mod                 12636   8
    module: scsi_mod              103216   2 [aic7xxx sd_mod]

928.2 [proc; pid 2948; ../AIR5.2.5/bin/align_warp] version 0
    type: proc
    pid: 2948
    name: ../AIR5.2.5/bin/align_warp
    argv[0]: ../AIR5.2.5/bin/align_warp
    argv[1]: ../anatomy1.img
    argv[2]: ../reference.img
    argv[3]: warp1.warp
    argv[4]: -m
    argv[5]: 12
    argv[6]: -q
    env: PWD=/pass/fs/challenge/q7
    env: VENDOR=intel
    env: REMOTEHOST=tanaqui.eecs.harvard.edu
    env: HOSTNAME=kastchei
    env: LESSOPEN=|/usr/bin/lesspipe.sh %s
    env: USER=dholland
    env: MACHTYPE=i386
    env: MAIL=/var/spool/mail/dholland
    env: LANG=en_US.iso885915
    env: HOST=kastchei
    env: LOGNAME=dholland
    env: SHLVL=3
    env: GROUP=root
    env: SUPPORTED=en_US.iso885915:en_US:en
    env: SHELL=/bin/tcsh
    env: PRINTER=cork
    env: HOSTTYPE=i386-linux
    env: OSTYPE=linux
    env: HOME=/home/dholland
    env: TERM=xterm
    env: PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin
    env: _=../AIR5.2.5/bin/align_warp
    kernel: Linux 2.4.29+autoprov #174 Sun Sep 10 22:32:46 EDT 2006
    module: pasta                  58568   0 (autoclean) (unused)
    module: kbdb                  610048   0 (autoclean) [pasta]
    module: 3c59x                  28680   1
    module: ipchains               47756  15
    module: aic7xxx               152064   4
    module: sd_mod                 12636   8
    module: scsi_mod              103216   2 [aic7xxx sd_mod]

908.2 [proc; pid 2928; AIR5.2.5/bin/align_warp] version 0
    type: proc
    pid: 2928
    name: AIR5.2.5/bin/align_warp
    argv[0]: AIR5.2.5/bin/align_warp
    argv[1]: anatomy4.img
    argv[2]: reference.img
    argv[3]: warp4.warp
    argv[4]: -m
    argv[5]: 12
    argv[6]: -q
    env: PWD=/pass/fs/challenge
    env: VENDOR=intel
    env: REMOTEHOST=tanaqui.eecs.harvard.edu
    env: HOSTNAME=kastchei
    env: LESSOPEN=|/usr/bin/lesspipe.sh %s
    env: USER=dholland
    env: MACHTYPE=i386
    env: MAIL=/var/spool/mail/dholland
    env: LANG=en_US.iso885915
    env: HOST=kastchei
    env: LOGNAME=dholland
    env: SHLVL=3
    env: GROUP=root
    env: SUPPORTED=en_US.iso885915:en_US:en
    env: SHELL=/bin/tcsh
    env: PRINTER=cork
    env: HOSTTYPE=i386-linux
    env: OSTYPE=linux
    env: HOME=/home/dholland
    env: TERM=xterm
    env: PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin
    env: _=AIR5.2.5/bin/align_warp
    kernel: Linux 2.4.29+autoprov #174 Sun Sep 10 22:32:46 EDT 2006
    module: pasta                  58568   0 (autoclean) (unused)
    module: kbdb                  610048   0 (autoclean) [pasta]
    module: 3c59x                  28680   1
    module: ipchains               47756  15
    module: aic7xxx               152064   4
    module: sd_mod                 12636   8
    module: scsi_mod              103216   2 [aic7xxx sd_mod]

907.2 [proc; pid 2927; AIR5.2.5/bin/align_warp] version 0
    type: proc
    pid: 2927
    name: AIR5.2.5/bin/align_warp
    argv[0]: AIR5.2.5/bin/align_warp
    argv[1]: anatomy3.img
    argv[2]: reference.img
    argv[3]: warp3.warp
    argv[4]: -m
    argv[5]: 12
    argv[6]: -q
    env: PWD=/pass/fs/challenge
    env: VENDOR=intel
    env: REMOTEHOST=tanaqui.eecs.harvard.edu
    env: HOSTNAME=kastchei
    env: LESSOPEN=|/usr/bin/lesspipe.sh %s
    env: USER=dholland
    env: MACHTYPE=i386
    env: MAIL=/var/spool/mail/dholland
    env: LANG=en_US.iso885915
    env: HOST=kastchei
    env: LOGNAME=dholland
    env: SHLVL=3
    env: GROUP=root
    env: SUPPORTED=en_US.iso885915:en_US:en
    env: SHELL=/bin/tcsh
    env: PRINTER=cork
    env: HOSTTYPE=i386-linux
    env: OSTYPE=linux
    env: HOME=/home/dholland
    env: TERM=xterm
    env: PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin
    env: _=AIR5.2.5/bin/align_warp
    kernel: Linux 2.4.29+autoprov #174 Sun Sep 10 22:32:46 EDT 2006
    module: pasta                  58568   0 (autoclean) (unused)
    module: kbdb                  610048   0 (autoclean) [pasta]
    module: 3c59x                  28680   1
    module: ipchains               47756  15
    module: aic7xxx               152064   4
    module: sd_mod                 12636   8
    module: scsi_mod              103216   2 [aic7xxx sd_mod]

906.2 [proc; pid 2926; AIR5.2.5/bin/align_warp] version 0
    type: proc
    pid: 2926
    name: AIR5.2.5/bin/align_warp
    argv[0]: AIR5.2.5/bin/align_warp
    argv[1]: anatomy2.img
    argv[2]: reference.img
    argv[3]: warp2.warp
    argv[4]: -m
    argv[5]: 12
    argv[6]: -q
    env: PWD=/pass/fs/challenge
    env: VENDOR=intel
    env: REMOTEHOST=tanaqui.eecs.harvard.edu
    env: HOSTNAME=kastchei
    env: LESSOPEN=|/usr/bin/lesspipe.sh %s
    env: USER=dholland
    env: MACHTYPE=i386
    env: MAIL=/var/spool/mail/dholland
    env: LANG=en_US.iso885915
    env: HOST=kastchei
    env: LOGNAME=dholland
    env: SHLVL=3
    env: GROUP=root
    env: SUPPORTED=en_US.iso885915:en_US:en
    env: SHELL=/bin/tcsh
    env: PRINTER=cork
    env: HOSTTYPE=i386-linux
    env: OSTYPE=linux
    env: HOME=/home/dholland
    env: TERM=xterm
    env: PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin
    env: _=AIR5.2.5/bin/align_warp
    kernel: Linux 2.4.29+autoprov #174 Sun Sep 10 22:32:46 EDT 2006
    module: pasta                  58568   0 (autoclean) (unused)
    module: kbdb                  610048   0 (autoclean) [pasta]
    module: 3c59x                  28680   1
    module: ipchains               47756  15
    module: aic7xxx               152064   4
    module: sd_mod                 12636   8
    module: scsi_mod              103216   2 [aic7xxx sd_mod]

905.2 [proc; pid 2925; AIR5.2.5/bin/align_warp] version 0
    type: proc
    pid: 2925
    name: AIR5.2.5/bin/align_warp
    argv[0]: AIR5.2.5/bin/align_warp
    argv[1]: anatomy1.img
    argv[2]: reference.img
    argv[3]: warp1.warp
    argv[4]: -m
    argv[5]: 12
    argv[6]: -q
    env: PWD=/pass/fs/challenge
    env: VENDOR=intel
    env: REMOTEHOST=tanaqui.eecs.harvard.edu
    env: HOSTNAME=kastchei
    env: LESSOPEN=|/usr/bin/lesspipe.sh %s
    env: USER=dholland
    env: MACHTYPE=i386
    env: MAIL=/var/spool/mail/dholland
    env: LANG=en_US.iso885915
    env: HOST=kastchei
    env: LOGNAME=dholland
    env: SHLVL=3
    env: GROUP=root
    env: SUPPORTED=en_US.iso885915:en_US:en
    env: SHELL=/bin/tcsh
    env: PRINTER=cork
    env: HOSTTYPE=i386-linux
    env: OSTYPE=linux
    env: HOME=/home/dholland
    env: TERM=xterm
    env: PATH=/usr/local/bin:/usr/X11R6/bin:/usr/bin:/bin
    env: _=AIR5.2.5/bin/align_warp
    kernel: Linux 2.4.29+autoprov #174 Sun Sep 10 22:32:46 EDT 2006
    module: pasta                  58568   0 (autoclean) (unused)
    module: kbdb                  610048   0 (autoclean) [pasta]
    module: 3c59x                  28680   1
    module: ipchains               47756  15
    module: aic7xxx               152064   4
    module: sd_mod                 12636   8
    module: scsi_mod              103216   2 [aic7xxx sd_mod]

Query 5:

Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers had an entry global maximum=4095. The contents of a header file can be extracted as text using the scanheader AIR utility.

PASS v1 was never intended to be able to interoperate with application-specific provenance systems, at least not directly. All the same, it's possible to answer the question, by writing the query in two stages, where the first gives a list of files to interrogate with scanheader and the second generates the desired report given the subset of those files that matched.

Another way to do it is to run scanheader on everything and feed those annotations into PASS; then one can query on the annotations.

The approach we took was the first:

ALIGN_WARPS=`$NQ $NQOPTS '
    select ident from everything
    where type == "proc" && basename == "align_warp"
    table'`
$NQ $NQOPTS '
    select name from ancestors { '"$ALIGN_WARPS"' } depth 1 
    where basename ~ "*.hdr"
    table'

gives a list of .hdr files used as input to align_warp (as opposed to every .hdr file that exists) which can then be tested with scanheader.

One can then hand the matching ones (which is all of them) to this query:

nq 'descendents { anatomy1.hdr anatomy2.hdr anatomy3.hdr anatomy4.hdr reference.hdr }
    where basename ~ "atlas*.gif" || basename ~ "atlas*.jpg"
    report'

giving the results

947.0 [passfile; challenge/q7/atlas-z.jpg] version 1
    type: passfile
    name: challenge/q7/atlas-z.jpg
    annotation: dim=z
    annotation: run=q7
    annotation: studyModality=visual

946.0 [passfile; challenge/q7/atlas-y.jpg] version 1
    type: passfile
    name: challenge/q7/atlas-y.jpg
    annotation: dim=y
    annotation: run=q7
    annotation: studyModality=visual

945.0 [passfile; challenge/q7/atlas-x.jpg] version 1
    type: passfile
    name: challenge/q7/atlas-x.jpg
    annotation: dim=x
    annotation: run=q7
    annotation: studyModality=visual

924.0 [passfile; challenge/atlas-z.gif] version 1
    type: passfile
    name: challenge/atlas-z.gif
    annotation: dim=z
    annotation: run=base
    annotation: studyModality=mindreading

923.0 [passfile; challenge/atlas-y.gif] version 1
    type: passfile
    name: challenge/atlas-y.gif
    annotation: dim=y
    annotation: run=base
    annotation: studyModality=mindreading

922.0 [passfile; challenge/atlas-x.gif] version 1
    type: passfile
    name: challenge/atlas-x.gif
    annotation: dim=x
    annotation: run=base
    annotation: studyModality=mindreading

Query 6:

Find all output averaged images of softmean (average) procedures, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model, i.e. "where softmean was preceded in the workflow, directly or indirectly, by an align_warp procedure with argument -m 12."

As mentioned above, nq cannot express this condition directly, but we can run the query in stages using the shell, as follows:

ALIGN_WARPS=`nq '
    select ident from everything 
    where type == "proc" && basename == "align_warp" && concat(argv) ~ "*-m 12*"
    table'`
SOFTMEANS=`nq '
    select ident from descendents { '"$ALIGN_WARPS"' }
    where type == "proc" && basename == "softmean"
    table'`
nq 'select name from descendents { '"$SOFTMEANS"' } depth 1 
     where type == "passfile" && basename ~ "*.img"
     report'

This first finds the align_warp processes, uses them to find the suitable softmean processes, and then uses those to generate the names of the output, using a depth limit on the recursive search.

The "table" output format prints the selected output fields in columns; when you select one field, you get a list. A list of "ident" in this form (ident is the pnode and subobject number) can be fed back into the next query.

A recursive search that starts from multiple objects (in braces) generates a single graph and thus a list of output objects without creating duplicates.

Results:

940.0 [passfile; challenge/q7/atlas.img] version 1
    name: challenge/q7/atlas.img

917.0 [passfile; challenge/atlas.img] version 1
    name: challenge/atlas.img

Query 7:

A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs. The exact level of detail in the difference that is detected by a system is up to each participant.

While we have talked about special-purpose diff queries, we haven't yet implemented them. So we use textual diff on a pair of reports.

The output this generates contains some noise but is more useful than one might first think. It could be made better still if nq offered more control over the precise output format. Not printing the pnode and subobject numbers of non-PASS files (which do not mean a great deal) would remove most of the noise.

(To quantify that a little, each of the reports is about 110,000 lines, and the diff is about 11,000 lines.)

As menetioned above, the modified workload was run in a subdirectory, on the same original input files, with the pathnames in the workload script updated accordingly.

The workload script was edited online using emacs; the emacs process can be seen in the provenance.

nq 'ancestors atlas-x.gif report' > q7-a.tmp
nq 'ancestors q7/atlas-x.jpg report' > q7-b.tmp
diff -u q7-a.tmp q7-b.tmp

Results:

Query 8:

A user has annotated some anatomy images with a key-value pair center=UChicago. Find the outputs of align_warp where the inputs are annotated with center=UChicago.

To make this possible, after running the workload we added this annotation to two of the input files and not the others.

Note that we added the annotation after the workload on purpose; the file is still the same version, and we want to be able to annotate files after the fact and still search on the annotations. (This is useful, for example, for marking files bad.)

This ends up being another multi-stage query similar to Q6:

INPUTS=`nq 'select ident from everything where $center == "UChicago" table'`
WARPS=`nq '
    select ident from descendents { '"$INPUTS"' } depth 1
    where type == "proc" && basename == "align_warp"
    table'`
nq 'descendents { '"$WARPS"' } anchor type == "passfile" where type == "passfile" report'

The '$' allows referring to a user annotation. This prevents name conflicts between user annotations and built-in fields.

We begin by finding the annotated file, then find the immediate (depth 1) descendent processes of a suitable form; then we find specifically the output files from those processes - the anchor stops the search at the first file down, and the type restriction prevents anything in between from appearing. This form allows for possible pipes and output filter processes in between, and is included here for demonstration purposes.

Output:


930.0 [passfile; challenge/q7/warp3.warp] version 1
    type: passfile
    name: challenge/q7/warp3.warp

929.0 [passfile; challenge/q7/warp2.warp] version 1
    type: passfile
    name: challenge/q7/warp2.warp

907.0 [passfile; challenge/warp3.warp] version 1
    type: passfile
    name: challenge/warp3.warp

906.0 [passfile; challenge/warp2.warp] version 1
    type: passfile
    name: challenge/warp2.warp

Note that it found the descendents in both workloads run.

Query 9:

A user has annotated some atlas graphics with key-value pair where the key is studyModality. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.

To make this possible we annotated all six output images (from both the regular and Q7-variant workloads), one set with study modality "mindreading" (why? why not?) and the other with study modality "visual", and some other annotations too, which you will see in the output.

The query:

nq 'select annotations from everything
        where (basename ~ "atlas*.gif" || basename ~ "atlas*.jpg") && (
            $studyModality == "speech" ||
            $studyModality == "visual" ||
            $studyModality == "audio")
        report
'

and results:

947.0 [passfile; challenge/q7/atlas-z.jpg] version 1
    annotation: dim=z
    annotation: run=q7
    annotation: studyModality=visual

946.0 [passfile; challenge/q7/atlas-y.jpg] version 1
    annotation: dim=y
    annotation: run=q7
    annotation: studyModality=visual

945.0 [passfile; challenge/q7/atlas-x.jpg] version 1
    annotation: dim=x
    annotation: run=q7
    annotation: studyModality=visual

Suggested Workflow Variants

Our system was intended to support shell pipelines, so variants that use more pipes and fewer intermediate files would exhibit that property. This is particularly true of pipes with multiple writers, such as the shell construct

(
    echo '-------- File 1 --------'
    cat somefile
    echo '-------- File 2 --------'
    cat otherfile
    echo '-------- Diffs --------'
    diff somefile otherfile
) | processing-script > outputfile

Suggested Queries

Here are some queries not in the list that we can support:

With nq one would do this by first finding the slicer process:

nq 'select ident from ancestors atlas-x.gif where type == "proc" && basename == "slicer" table'

then search the ancestry of that process for a file in the FSL source distribution. Or for a file named "fakeslicer":

nq 'select name from ancestors x.y where type == "passfile" && basename == "fakeslicer" table'

One can do this by searching the ancestors for a suitable matching file and inspecting whether the output is empty:

nq 'ancestors atlas-z.gif where basename == "atlas-x.gif" report'

nq 'ancestors atlas-x.gif where type == "proc" && basename ~ "*cc" && concat(argv) ~ "*-ffast-math*" report'

(-ffast-math enables floating-point optimizations that you do not want in any kind of numerical program.)

(This is the same as Q1, possibly with some additional limits on the search; it's enabled because we don't require workloads to be formalized in advance.)

Our goal was to be able to generate the workflow itself directly as a runnable shell script; this has proven difficult.

Some queries not in the list that we cannot currently support, but expect to be able to in the future:

This is a slightly different kind of recursive search, easily supportable with our current databases but not implemented.

This is a form of ancestry query, reporting only the objects that no longer exist or have been changed. nq does not do this, although an earlier version of it did, after a fashion. The additional questions in parentheses are hard.

We cannot do this now because the core parts of the OS cannot be put on a PASS volume. This technical limitation is expected to be lifted in PASS v2. Otherwise it's a simple ancestry query, perhaps something of the form

LIBS=`nq 'select ident from ancestors align_warp where basename ~ "libm.so*" table'`
nq 'select name from ancestors { '"$LIBS"' } where basename ~ "*.rpm" report'

Categorisation of queries

According to your provenance approach, you may be able to provide a categorisation of queries. Can you elaborate on the categorisation and its rationale.

Live systems

We have a machine up (or at least sometimes up) running the PASS v1 prototype; contact us for login access.

Also, if you're adventurous (and comfortable building Linux kernels) you can run it yourself; contact us for a copy. Note that it's not really suitable for production use.

Further Comments

Provide here further comments.

Conclusions

Provide here your conclusions on the challenge, and issues that you like to see discussed at a face to face meeting.

-- PassProject - 08 Sep 2006 -- PassProject - 13 Sep 2006
to top


End of topic
Skip to action links | Back to top

You are here: Challenge > FirstProvenanceChallenge > ParticipatingTeams > PASS

to top

Copyright © 1999-2012 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.