Introduction

The ability to identify conserved waters from a collection of related protein structues is important for gaining a better understanding of the ligand binding environment. The vanddraabe package is based on the work of Sanschagrin and Kuhn (Protein Science, 1998, 7 (10), pp 2054-2064. DOI: 10.1002/pro.5560071002) and Patel, Gruning, Gunther, and Merfort (Bioinformatics, 2014, 30 (20), pp 2978-2980. DOI: 10.1093/bioinformatics/btu424). Expanding on WatCH and PyWATER, vanddraabe and returns statistical parameters for each water cluster, informative graphs, and a PyMOL session file to visually explore the conserved waters and protein along with intermediate information.

This vignette demonstrates the steps and thought process to analyze the conserved waters of ten thrombin crystallographically determined structures and is based on the work of Sanschagrin and Kuhn (Protein Science, 1998, 7 (10), pp 2054-2064). The results presented herein are part of the original vanddraabe article. There are six main steps to determine conserved water within a collection of protein structures:

Before identifying and analyzing conserved waters, several R packages need to be loaded in addition to vanddraabe. To aid consistency, the filename prefix needs to be defined. For this example "thrombin10" is being used. The files (PDB, Excel workbook, and PyMOL session files) and directories (folders) generated during this analysis will start with thrombin10 while the files will also include a date-time stamp to differentiate results.

library(vanddraabe)
library(bio3d)
library(reshape2)
library(ggplot2)
library(cowplot)

thrombin10.filename <- "thrombin10"

Download PDB structures

thrombin10.PDBids <- c("1hai", "1abj", "1ppb", "1tmb", "1hah",
                       "1tmt", "1abi", "1thr", "1ths", "1ihs")

thrombin10.PDBs <- get.pdb(ids=thrombin10.PDBids, split=FALSE, path="thrombin10_rawPDBs")

Determine structure quality

The quality of the structures impacts the results of the conserved water analysis. Often the resolution is used to define the quality of the structure but Robserved and Rfree should also be taken into consideration. Smaller resolution values indicate a greater confidence in the location of atoms. Protein structures with reported resolution values greater than or equal to 3.0 Angstroms illustrate the basic contours of the protein chain and thus the atomic structure of the backbone and sidechains is inferred. The Robserved – also known as R-value Observed – value indicates how well the “modeled” atoms of the protein structure match the electron density maps with values of 0.20 or less being typical. The corresponding Rfree value is how well a held-out collection of 5-10% of the atoms were fit; values of 0.26 or less are considered acceptable. In vanddraabe structures are evaluated using any combinate of the resolution, Robserved, and Rfree. Not all structures have Robserved and Rfree values reported. Only the resolution values are provided for the thrombin example presented here.

thrombin10.rcsbCLEANING <- getRCSBdata(prefix = "./thrombin10_rawPDBs",
                                       resolution = 3.0,
                                       rFree = NULL,
                                       rObserved = NULL,
                                       filename = thrombin10.filename)
## Please be patient... Getting PDB information from www.rcsb.org
## The R-observed cutoff is set to "NULL" and is not a factor in evaluating structures for removal.
## The R-free cutoff is set to "NULL" and is not a factor in evaluating structures for removal.
## 
## ----- getRCSBdata SUMMARY _____
## getRCSBdata is DONE!
## RCSB information for each PDB structure was written to the Excel workbook: thrombin10_DATA_RESULTS.xlsx
## All structures (10) PASSED the structure evaluation requirements and were copied to the "thrombin10_RCSB_passed" folder.

PDB structures with values greater than those provided are removed from further analysis. To remove a structural evaluation from the RCSB cleaning provide NULL. If no resolution value is provided the CleanRCSBdataset will automatically use 3.0. The following information is returned:

All ten thrombin structures passed the 3.0 Angstrom resolution requirement. The following table contains some structural information for the thrombin structures.

Portion of the RCSB Structural Information
chainId resolution experimentalTechnique source citation depositionDate
1ABI H,I,L 2.3 X-RAY DIFFRACTION Homo sapiens Qiu et al. Biochemistry (1992) 1992-08-24
1ABJ H,L 2.4 X-RAY DIFFRACTION Homo sapiens Qiu et al. Biochemistry (1992) 1992-08-24
1HAH H,I,L 2.3 X-RAY DIFFRACTION Homo sapiens Vijayalakshmi et al. Protein Sci. (1994) 1994-06-27
1HAI H,L 2.4 X-RAY DIFFRACTION Homo sapiens Vijayalakshmi et al. Protein Sci. (1994) 1994-06-27
1IHS H,I,L 2.0 X-RAY DIFFRACTION Homo sapiens Zdanov et al. Proteins (1993) 1993-08-04
1PPB H,L 1.92 X-RAY DIFFRACTION Homo sapiens Bode et al. EMBO J. (1989) 1991-10-24
1THR H,I,L 2.3 X-RAY DIFFRACTION Homo sapiens Qiu et al. J.Biol.Chem. (1993) 1993-06-16
1THS H,I,L 2.2 X-RAY DIFFRACTION Homo sapiens Qiu et al. J.Biol.Chem. (1993) 1993-06-16
1TMB H,I,L,T 2.3 X-RAY DIFFRACTION Homo sapiens Maryanoff et al. Proc.Natl.Acad.Sci.USA (1993) 1993-05-27
1TMT H,I,J,L 2.2 X-RAY DIFFRACTION Homo sapiens Priestle et al. Protein Sci. (1993) 1994-05-26

Clean PDB structures

Protein structures from the RCSB commonly do not contain hydrogen atoms but there are the rare occurrence of hydrogen atoms being added by the depositing authors. Often atoms will be modeled – added by the crystallographer and/or crystallographic software – when there is not enough electron density to resolve the atom. This is common when a portion of the amino acid residue is resolved and based on the protein’s sequence the “missing” portion of the residue is know. Atoms are also removed when they are assigned a B-value or occupancy value outside the normal range. B-values have a range of 0 - 100 (0 is no variation in position and 100 being diffuse; values less than 40 are considered optimal) and occupancy values have a range of 0 to 1.0 (0 being no occupancy and 1.0 being present in all reflections; values greater than 0.90 are considered optimal).

PDB files obtained from the PDB conform to a specific set of formatting standards but this does not mean the data within the PDB files is always correct. This function cleans the PDB file and summaries the atom evaluations. This function does the following (in this order):

thrombin10.CLEANED <- CleanProteinStructures(prefix = "./thrombin10_RCSB_passed",
                                             CleanHydrogenAtoms = TRUE,
                                             CleanModeledAtoms = TRUE,
                                             cutoff.prot.h2o.dist = 6.0,
                                             cleanDir = thrombin10.filename,
                                             filename = thrombin10.filename)
## Cleaning 1abi...
##   HEADER    HYDROLASE/HYDROLASE INHIBITOR           24-AUG-92   1ABI
##  - 241 of the 246 water oxygen atoms are within 6 Angstroms of the protein
##  - Wrote 1abi_cleaned.pdb to ./thrombin10_CLEANED
## Cleaning 1abj...
##   HEADER    HYDROLASE/HYDROLASE INHIBITOR           24-AUG-92   1ABJ
##  - 192 of the 196 water oxygen atoms are within 6 Angstroms of the protein
##  - Wrote 1abj_cleaned.pdb to ./thrombin10_CLEANED
## Cleaning 1hah...
##   HEADER    COMPLEX(SERINE PROTEINASE/INHIBITOR)    27-JUN-94   1HAH               
##    PDB has ALT records, taking A only, rm.alt=TRUE
##  - Removed modeled atoms
##  - 204 of the 204 water oxygen atoms are within 6 Angstroms of the protein
##  - Wrote 1hah_cleaned.pdb to ./thrombin10_CLEANED
## Cleaning 1hai...
##   HEADER    HYDROLASE/HYDROLASE INHIBITOR           27-JUN-94   1HAI               
##    PDB has ALT records, taking A only, rm.alt=TRUE
##  - Removed modeled atoms
##  - 194 of the 194 water oxygen atoms are within 6 Angstroms of the protein
##  - Wrote 1hai_cleaned.pdb to ./thrombin10_CLEANED
## Cleaning 1ihs...
##   HEADER    HYDROLASE/HYDROLASE INHIBITOR           04-AUG-93   1IHS
##  - 146 of the 146 water oxygen atoms are within 6 Angstroms of the protein
##  - Wrote 1ihs_cleaned.pdb to ./thrombin10_CLEANED
## Cleaning 1ppb...
##   HEADER    HYDROLASE/HYDROLASE INHIBITOR           24-OCT-91   1PPB
##  - 333 of the 402 water oxygen atoms are within 6 Angstroms of the protein
##  - Wrote 1ppb_cleaned.pdb to ./thrombin10_CLEANED
## Cleaning 1thr...
##   HEADER    HYDROLASE(SERINE PROTEINASE)            16-JUN-93   1THR
##  - Removed modeled atoms
##  - 190 of the 190 water oxygen atoms are within 6 Angstroms of the protein
##  - Wrote 1thr_cleaned.pdb to ./thrombin10_CLEANED
## Cleaning 1ths...
##   HEADER    HYDROLASE/HYDROLASE INHIBITOR           16-JUN-93   1THS
##  - 140 of the 140 water oxygen atoms are within 6 Angstroms of the protein
##  - Wrote 1ths_cleaned.pdb to ./thrombin10_CLEANED
## Cleaning 1tmb...
##   HEADER    HYDROLASE/HYDROLASE INHIBITOR           27-MAY-93   1TMB
##  - 229 of the 239 water oxygen atoms are within 6 Angstroms of the protein
##  - Wrote 1tmb_cleaned.pdb to ./thrombin10_CLEANED
## Cleaning 1tmt...
##   HEADER    HYDROLASE/HYDROLASE INHIBITOR           26-MAY-94   1TMT
##  - 111 of the 111 water oxygen atoms are within 6 Angstroms of the protein
##  - Wrote 1tmt_cleaned.pdb to ./thrombin10_CLEANED
## ----- Results written to Excel workbook _____

The information returned from cleaning the protein structure are:

Cleaning Summary
removedHydrogens num.o.OoR num.b.OoR num.Modeled num.notModeled num.WatersDistantRemoved num.WatersRetained
1abi FALSE 0 0 0 2703 5 241
1abj FALSE 0 0 0 2531 4 192
1hah FALSE 0 0 63 2590 0 204
1hai FALSE 0 0 54 2524 0 194
1ihs FALSE 0 0 0 2561 0 146
1ppb FALSE 0 48 0 2771 69 333
1thr FALSE 0 0 15 2526 0 190
1ths FALSE 0 0 0 2529 0 140
1tmb FALSE 0 0 0 2644 10 229
1tmt FALSE 0 0 0 2526 0 111

The cleaning.summary is a table shows there were no thrombin structures with occupancy values outside the normal range of 0 to 1 but there was a single structure, 1ppb, with 48 atoms assigned B-values outside the normal range of 0 to 100. These 48 atoms were removed from 1ppb. Three structures had atoms with occupancy values of 0.01 or less and these atoms were also removed. Four structures had water oxygen atoms beyond 6 Angstroms from a protein atom and thus these “distant” waters were removed.

The B-value barplots – before the structure is cleaned – illustrates the difference in quality of structures based on B-values. Three structures (PDBids: 1ihs, 1ppb, and 1tmt) in the Thrombin dataset have atoms with B-values of 65 or greater. Atoms with B-values greater than 60 are considered lower quality because of their greater variance.

Normalizing the B-values provides a way to compare B-values across a collection of protein structures; Z-score values less then 0 indicate atoms with B-values less than the mean B-value for a structure while Z-score values greater than 0 indicates and atom with B-values greater than the mean B-value. Within vanddraabe, water atoms with normalized B-values greater than 1.0 are removed from analysis. The inclusion of all atoms within the protein structures indicates the overall quality of the atoms within the structure. Normalized B-values are calculated for protein, non-protein (ligands), and water atoms separately during the evaluation of atoms prior to determining conserved waters.