Crystallography Workflow
Outline
This page outlines the crystallography workflow for the Fourth and Last Provenance Challenge. It describes the workflow and how it maps the abstract scenario. This page also provides all the required information and data needed to run the workflow and take part in the challenge.
Brief Summary
Crystallography is the experimental science of determining the arrangement of atoms in solids. Crystallographic methods depend on the analysis of the diffraction patterns that emerge from a crystal sample that is targeted by X-ray beams. In the scenario described below, scientists perform a series of steps to produce a set of atom coordinates from a crystal, and then publish this on a public database. The raw data and conduct of an experiment which produced a crystal image are important for others to interpret the quality of that image.
This experiment is one performed by crystallographers working in King's College London. The process abstracts from the details but has been confirmed to be realistic, and the provenance questions are ones which have been confirmed as valuable to answer.
Example Workflow
In this section, we first show the general worklow for performing a crystallography experiment, then specify a single example of this workflow being executed.
Full size link
In the image above the red outlined boxes around areas of the workflow refer to areas which are mapped from the abstract scenario. All of these have their names to the left of the box. The light blue labels attached to each element of the workflow are to help explain the scenario as descried below.
We will now describe a single instance of the experiment. Here, the crystallographer had two crystal samples of the same protein and wanted to work out the structure of the molecule using x-ray diffraction techniques.
This section is split up into stages purely for clarification of explanation, i.e. no other semantics should be attached to the division. The letters in brackets refer to artifacts and processes in the general workflow above.
First stage.
- 1. Three configuration parameters (wavelength, beamline angle, polarisation) were all selected and the crystal(A) mounted in the synchrotron
- 2. We collected three diffraction images(C) captured using the synchrotron(B).
- 3. The three images(C) were then put through mosflm to visualise the images captured(D).
- 4. The images(E) were then inspected by our crystallographer(F) and it was decided that they were not good enough and the crystal was not of adequate quality.
- 5. We then replaced the crystal(A) with the other sample and began the procedure again, returning to the start of the experiment.
- 5.1. The wavelength and beamline were setup exactly the same as with the previous sample.
- 5.2. We then collected three more diffraction images from the new sample.
- 5.3. We visualised these images in mosflm(D).
- 5.4. The images were then inspected(F) and it was decided that they were of adequate quality to continue.
- 6. We still required more diffraction images(C), using a variety of beamline angles, to get an comprehensive reflection file so we returned to the start of the loop.
- 5. We then proceeded to take more diffraction images(C), 360 in total
Second stage
- Mosflm was then used to generate an unmerged reflection file(H) from the diffraction images(C) we just obtained(G).
- This file was then transformed(I) into a merged reflection file(J).
- 1. Before it can be merged we needed to rearrange it as the unmerged files are given in no particular order.
- 2. We then merged the file using SCALA, which is part of the CCP4 suite
Third Stage
- 9 A coordinates file(M) was then generated(L) using the merged reflection file(J) through CCP4 and refinement statistics.
- 9.1. We used the PDB search tool to identify the family of the protein from the sequence data(K). This search returns a PDB file for a closely comparable protein.
- 9.2. This is used to identify some of the structure of the protein. The PDB file is then used as the basis of the coordinates file of the protein we are trying to solve.
- 10. Then the coordinates file(M) was visualised using coot(N) and interactive changes were made to the sequence.
- 10.1. Input into coot - coordinates, reflection refined from previous step
- 10.2. Statistics of good fit generated from previous step (includes weights)
- 11. We then checked our new co-ordinates(Q) and the statistics from above. It was not accurate enough so we then ran it through coot four more times making changes until it have been sufficiently improved. We now have our final coordinates file(R)
Fourth Stage
- 12. At this point we had now solved the structure of the protein, but we were required to add extra details(S) to the coordinates file(R) before the data could be submitted to the PDB.
- 13. The protein(T) is now submitted to the PDB(U) and a web reference is produced(V)
- 14. We create a wiki page to discuss the results(W) and create a report based on them(X)
- 15. A paper(X) is produced via the wiki page discussion which cites the data published on the PDB
- 16. We review the paper along with the other contributors(Y), they are not happy with it in its current state and we return to the wiki(W) and discuss the changes to be made. A new version is now created(Z) which everyone is happy with
Fifth Stage
- 17. We submit(AA) the paper to the biochemistry journal and it is published(AB).
- 18. User finds the paper via a query(AC) on the biology journal website and retrieves the URL(AD).
- 19. The user then tweets(AE) his thoughts on the paper and the collaborative editing stage(W) is returned to based on tweets about the paper.
The diagram below shows the cardinality relationship between the different data objects produced by running the work flow and how they are related to each other. This is provided in addition to the workflow to aid in understanding in what is produced from each stage of the workflow.
Sample OPM Graph
At the link below, we show a hand-made, mocked-up
OPM graph to describe the above experiment. This should be used for further clarification as to the structure of the experiment.
We intend to refine this graph, in particular to show accounts of differing granularities.
Sample OPM Graph
Mapping from Abstract Scenario
Below, we map from each step in the abstract scenario to the procedures which comprise the crystallography workflow.
User Performs Action
This is mapped to the beginning of the workflow(A,B) because at the beginning of the experiment the crystallographer must choose which crystal they will use, mount it on the synchrotron and get some diffraction images.
User Decision Point 1
The first user decision point is mapped to inspect images(E) as after inspecting the images the crystallographer must make a decision as to whether the images are good enough and if the crystal is still ok to use and if not
repeat the first part of the experiment again.
Exchange Between Services
The exchange between services happens between mosflm and the detector.(C,D,E)
Collection Manipulation
This is collection manipulation as the MTZ files are collections of data about spots which start off as unmerged(I) which contains data about the same spot multiple times. It then needs to be rearranged and merged to create a new file(J) which only contains one occurrence of each spot.
Running Service with Others' Data
This is mapped to this area of the workflow(L) because when the reflection data is processed to get coordinates and the statistical data used to tell if it is a good match. it is compared to the structure of similar proteins from the PDB. This is retrieved from the PBD using a search tool which matches the sequence data of proteins(K). This is then processed using services provided from CCP4.
Workflow
The workflow is mapped to this section as this part is a linear workflow(M,N,P) which will be automated for our process there are not any particular areas of interest within this from a provenance point of view. This produces an image of the protein in it's current form.
User Decision Point 2
This is 'check satisfaction of coordinates'(Q) as within this step the user must decide if the coordinates(P) generated through coot are accurate enough and if they are not go back and change them.
Publish Data To URL
The data is submitted to the PDB which will then mean it will have a web reference.(U)
Collaborative Editing
This is mapped to the section involving the wiki(W,X,Y), as the wiki is used for discussion and editing of the report by multiple collaborator.
Citing Paper
The data which is in the PDB is now cited in the report(Z) produced by collaboration on the wiki.
Credentials
Submitting to the biology journal(AA,AB) fits the credentials section, as to submit to the biochemistry journal, you need to be logged in the the appropriate sort of account.
Discovery By Query
Anyone is able to search the Biochemistry journal(AC) through the search link on the main page so the user can query for information about the protein solved in the scenario and discover the report(AD)
Social Collaboration
Social collaboration in the crystallography workflow is done through use of twitter(AE) to make comments and suggestions about the paper
File Formats used
Below, we describe the key data formats used and/or produced in the experiment.
Diffraction File Formats
Diffraction files come in many different formats which are decided by the manufacturer, some examples of these file types are
*.img; *.mar; *.mccd; *.image; *.sfrm; *.osc.
For our workflow, only one will be used to simplify things: *.img
The header metadata may be viewed if an image is opened in a text editor. Software such as Mosflm and HLK2000 read the image header and automatically extract the metadata when images are loaded. The metadata is used by the software to determine the experimental settings as part of the image integration and spot finding process
Visualisation Software
We use two types of visualisation software in the workflow. One is to visualise the diffraction images and the other to visualise the final crystal images.
For the workflow only Mosflm will be used for the visualisation of diffraction images to simplify the process.
Reflection files are stored as *.MTZ files.
MTZ is a binary flat-file format containing reflection data and a header of metadata
An MTZ file can be viewed using the CCP4 viewer program. If this is done then the file can be viewed as plain text.
These files come in 2 forms, merged and unmerged.
An unmerged reflection file consists of one big file which can be considered to be a collection of 'spots', where there will be multiple occurrences of the same spot within the unmerged file.
Spots are represented as a position and intensity, where spots of the same position can have different intensities.
Each spot carries a reference to the diffraction image from which it was extracted.
A merged file is still a collection of spots but each spot will only appear once and the intensity of each spot is the averaged intensities of every appearance of the spot from the unmerged file. This is done using SCALA
from CCP4.
For the visualisation of the final crystal image we will use coot as it can read a number of reflection file formats and coordinate data.
CCP4 Scripts
Sequence files
Text files consisting of letters that represent a sequence of amino acids and describe how they are chained together to build a protein.
Coordinate Files
The coordinate data is present in the pdb file and is the geometry that defines how different atoms are linked together in the protein molecule. These are line separated plain text files
Executables
The packages listed below are required to run the entire Crystallography workflow. Some of these may not be required depending on which area of the workflow is being worked on by a particular challenge team.
CCP4
This can be found at
http://www.ccp4.ac.uk/
It is required for the part of the workflow which specifies its use, but is also required to run Mosflm in the first section of the workflow.
CCP4 runs on Windows, Linux (generic x86 version or specific
RedHat? EL4 version) and Macintosh.
In addition to this, the source code is available as well.
It is simple to install, as there is a form on the website which will generate an archive containing all the packages required with instructions on how to install.
This should be installed before Mosflm to avoid any problems.
Mosflm
Mosflm can be found at
http://www.mrc-lmb.cam.ac.uk/harry/mosflm/
This is required for the first section of the workflow. Mosflm comes in two versions, we will be using the command line version.
Once CCP4 is installed this can just be run. It works on Linux, OSX and windows.
Coot
http://www.biop.ox.ac.uk/coot/
There exists builds for linux and windows. The source code is also freely available on the website.
Coot provides an interactive visualisation of coordinate files which outputs a new set of coordinates once manipulation is completed. Installing coot is easy, as all that needs to be done is to download the file, untar it, and then add a specified folder to your path environmental variable.
Execution Instructions
In this section, we will add more detailed instructions about executing the steps in the experiment yourself. Please refer to the section below for the input data used for each step.
Step 1: Visualise Images in Mosflm 1st Iteration
This requires Mosflm to have been installed.
Run
ipmoslfm <commands.txt
, in a directory containing
commands.txt and the three image files listed for this step below.
Alternatively, to perform this interactively, run
ipmoslfm
and, at the prompt, enter each command listed in
commands.txt in the directory containing the images.
IMPORTANT NOTE: This works on Linux but does not appear to work on Windows (the image files produced are empty), possibly due to a bug in the Windows version of Mosflm. We have contacted the authors of Mosflm to check why this might be.
Data
Below, we provide sample data for the experiment instance as described above.
Provenance Questions
NB: We wish to cover the range of questions proposed at the challenge workshop in Troy. The list is still under development, and suggestions are welcome.
We wish to ask the following questions about the provenance of a crystal image.
Question 1: It is 10 years after the process was conducted, and the process has become obsolete. For a given published crystal image (named by web reference), what was the raw diffraction images from which the crystal image was produced? Assume that the public database can contain only the coordinate and reflection files, and data kept on the desktop PC which ran the process has gone and knowledge in people's heads has been forgotten.
??
Question 2: For a given crystal, how often did a crystallographer reject and reproduce coordinates (the later stages of the experiment)? This is important because difficulty in obtaining an adequate crystal image can indicate that the original diffraction data was poor quality.
In our experiment this process was repeated 4 times.
Question 3: During the experiment the first loop is executed more than once. This could be done for a number of reasons. Was it to retrieve more images to continue the experiment or to try a new crystal?
First iteration was to try a new crystal
Second iteration was to gather more images to continue the experiment
Question 4: Sequence data is used to help solve the structure of the protein by finding a similar protein which has been solved and comparing the sequences, seeing how they differ. Where this is done what is the similar protein used and where was it sourced?
??
Question 5: The report has been published but how many times has it been edited before being published?
The report was edited 15 times?
Question 6: The identity and characteristics of the synchrotron may be lost over time if it is decommissioned. For a given PDB entry, what was the synchrotron used and what were the configurations for the experiment(Polarisation, wavelength and wavelength)?
Question 7: How many times has this data been cited in other reports?
4
Question 8: Who made the most edits to the report?
??
Question 9: After publishing the report it is discussed on twitter and edited again, how many times has the paper been edited as a result of discussion on twitter?
4?
Vocabulary and Ontology
In order to provide consistency in identifying types and concepts in provenance traces, we have prepared a common vocabulary of URIs. They are constructed as follows. The base URI for all the terms we define below is
http://twiki.ipaw.info/bin/viewauth/Challenge/Crystallography/Crystallography.owl
The set of terms is as follows:
- #Amino_Acid_Sequence
- #Binary_File
- #Bioinformatics_Journal
- #CCP4_Suite
- #Collaborative_Editing
- #Collection
- #Coordinates_File
- #Coot
- #Crystal_Sample
- #Crystallographer
- #Diffraction_Image
- #File
- #Journal
- #Mosflm
- #PDB
- #PDB_File
- #Paper
- #Publication
- #Refinement_statistics
- #Reflection_File
- #Scala
- #Scientist
- #Serivce
- #Social_Collaboration
- #Synchrotron
- #Text_File
- #Third_Party
- #Truncate
- #Tweet
- #Twitter
- #User
- #Webpage
- #WikiPage
- #hasEdited
- #wasApprovedBy
- #wasEditedBy
- #wasGeneratedFrom
- #wasInspectedBy
- #wasProducedBy
- #wasPublishedIn
- #wasPublishedTo
- #wasSubmittedTo
- #wasUsedIn
- #WebsiteURL
- #hasBeamlineAngle
- #hasPolarisation
- #hasWavelength
We have also prepared a tentative OWL file mapping concepts specific to the crystallography workflow to those in the abstract scenario. It is available
here.
--
CarlBarton - 23 Jun 2010
to top