August 2010 technical highlight
Servers and databases
Tracking the progress of all the PSI targets requires sophisticated databases.
The Structural Genomics Knowledgebase manages PSI data, giving an overall view of the PSI's progress.
Over the past ten years, the Protein Structure Initiative (PSI) has amassed a considerable amount of data and information. Here, we review the databases and servers that have played an essential role in organizing and analyzing these data, as well as in coordinating the activities of the 12 PSI-2 network centers.
The Structural Genomics Knowledgebase (SGKB) 1 collects all of the target, structural, modeling, and technological results by the PSI, integrates them with additional biological information from other key databases, and disseminates it to the public to enable their own research.
The SGKB manages PSI data through a series of integrated modules so that comprehensive information can easily be found from one place. For target management, more than 40,000 non-redundant sequences are in the PSI pipeline spread over several centers. Each center collects data and monitors progress using various LIMS, and deposits these data to two central PSI information-tracking databases within the SGKB: TargetDB, 2 which is a target registration database that provides data on the status of targets selected for structure determination and the experimental progress at every stage. PepcDB takes TargetDB one step further by including experimental information for each of the target sequences. Together, they give an overall view of the PSI's progress and provide avenues for everyone to learn from the PSI's experimental experiences.
Once the three-dimensional (3D) structures of the proteins are solved and deposited to the Protein Data Bank, comparative, or homology, models derived from evolutionarily or structurally related proteins can provide reliable protein models for sequences of unknown structure when there is significant similarity. Many methods exist for generating such models, but it has been difficult to access all models available because of the different software formats used and the different accession codes. The Protein Model Portal, 3 also a part of the SGKB, then gives access to all the models leveraged from PSI experimental targets. As of the writing of this article, the portal contains 13.5 million model structures provided by six PSI centers plus MODBASE 4 and the SWISS-MODEL Repository. 5
In addition to the results themselves, over 150 servers, databases, and technologies have been developed by the PSI centers for all steps of the structural biology pipeline. Reports and access to these offerings can be found in the PSI Technology Portal, while publications about all of the PSI's results can be found through the PSI Publications Portal.
The individual centers have also developed their own LIMS systems by adapting well-known software such as the Sesame LIMS 6 or in developing new systems specific to the tasks of structural genomics, such as the SPINE (Structure Proteomics in the NorthEast) LIMS system. 7, 8 These systems help organize and direct activities at the laboratory benchtop, archive the resulting data, share the data among various components of the individual centers, and output the data in a format suitable for PepcDB and TargetDB. To view data that are present in several linked databases, the graphical visualization software GraphViz has been helpful. 9
The data gathered since the beginning of PSI are used to guide all stages of the protein production pipeline. For example, information from TargetDB has been mined and used to develop the XtalPred Server, 10 the Protein Crystal Structure Propensity Prediction Server and protein crystallizability predictor. 11 These servers can help predict whether a construct will readily crystallize or not by taking into account biophysical characteristics, such as length, hydrophobicity or predicted disorder. Disorder prediction, particularly useful for construct design and optimization, can also be performed using the web-based tool DisMeta, which uses 14 different disorder predictor algorithms and six sequence-based structure-prediction programs.
Each structural genomics center has developed tools for high-throughput primer design. Primer Prim'er is a web-based tool which can be used to design primers for commonly used expression vectors or user-defined ones. 12, 13 It provides an extensive graphical interface that presents the user with information useful in construct design and is particularly useful for high-throughput biology because it can calculate primers for multiple targets and provide primer sets in a 96-well plates format.
Producing 3D protein structures is only part of PSI's remit. Facilitating the deduction of function from structure has increasingly become a feature of PSI-2 and several servers and databases have been developed to aid this. One example is The Open Protein Annotation Network (TOPSAN), which has elements similar in some aspects to Wikipedia and invites the biological community to aid in functional annotation of PSI structures.
Bioinformatics can also help with predicting function. The PSI has helped develop some of these methods while extensively testing many others that have been produced in the community. Thus, many different function-prediction methods exist and each algorithm is often designed to detect particular features. Function-prediction software often relies on accurate protein structure comparison. A flexible protein structure alignment algorithm FATCAT offers a way to perform similarity searches for a given protein structure against a database and, at the same time, also identifying structural rearrangements between homologous proteins. 14 A newly-developed TOPS++FATCAT algorithm uses a simplified description of protein structures to speed up the search for structural neighbors, which now can be performed almost interactively. 15 FATCAT has been recently included in the standard RCSB PDB tool set, and the structural neighbors calculated by FATCAT can be found for every protein structure on the RCSB PDB site.
Another valuable web-based functional annotation tool developed in PSI-2 is the MarkUs Functional Annotation Server. 16 MarkUs identifies related protein structures and sequences, detects protein cavities, and calculates the surface electrostatic potentials and amino-acid conservation profile. It can translate into the query structure the locations of ligands observed in structurally similar proteins, providing valuable functional insights. MarkUs is particularly useful as an interactive tool for function discovery from the 3D structures of proteins, provides extensive set of bioinformatics tools for functional annotation. A collection of PSI structures annotated with MarkUs is also available online.
Analysis of protein surfaces can reveal conserved functional features of proteins. The Global Protein Surface Survey (GPSS) is a library of annotated surfaces derived from structures in the PDB for studying evolutionary relationships and uncovering novel similarities between proteins. 17 The surface analysis can identify functionally homologous surfaces and predict protein function and ligand binding. Similarly the Procognate database 18 contains an assignment of PDB ligands to the domains of structures as classified by the CATH, SCOP and Pfam databases.
Even so, assessing function is not easy an easy task, owing to the lack of experimental data for many protein families. GeMMA (Genome Modelling and Model Annotation), a functional subfamily classification protocol, has been used by one PSI center to group their protein structures according to functional families. GeMMA uses two methods for identifying putative function: pattern recognition, which classifies protein according to locally conserved sequences, and clustering of sequences on the basis of their similarities. In addition, GeMMA can be trained to work on annotated families to establish suitable thresholds. 19
Another approach is to use many methods and see whether a consensus is possible. The ProFunc web server allows users to submit the coordinates of a structure and run a query against multiple function-prediction programs to identify motifs and close relationships with functionally characterized proteins. 20 The sequence and structure data are combined in Gene3D bioinformatics resource developed by PSI MCSG that is available to biology community. 21, 22
Once a structure has been solved, it is important to assess its quality. The Protein Structure Validation Software suite (PSVS) brings together several existing quality-evaluation tools, such as PROCHECK, MolProbity, Verify3D, ProsalI, the RCSB PDB validation software and several PSI-specific ones. It provides overall and site-specific indicators of quality, and global scores are presented as Z scores. PSVS analysis indicates that structural genomics projects have structure quality scores that are on average slightly better than structures produced by traditional structural biology projects over the past 10 years. NMR structures can also be validated by assessing 'goodness of fit' to the experimental NMR data. One quick and accurate method is to calculate Recall, Precision and F-measure from NOESY spectra. 23 Nuclear Overhauser effect (NOE) assignments are not needed for this and neither are complete relaxation matrix calculations, thus speeding up the process.
Many of the structures solved by the PSI are parts of biological networks. Multiple databases exist that display this protein-protein interaction data on the web, but none include structural biology results. The Human Cancer Pathway Protein Interaction Network (HCPIN) provides this information for several cancer-associated signaling pathways and provides homology models for proteins whose structures have not yet been solved. 24 Other examples include structure-function annotation galleries developed by the PSI project for protein pathways associated with FeS cluster assembly.
This review only covers some of the servers and databases that have been developed by the PSI. To find more tools relevant to your research, visit the PSI Technology Portal and explore by text or by experimental step in the pipeline.