This is a part of our little project, we performed during the course of Laboratory of Bioinformatics.

To begin with, I want to thank all of my groupmates, who helped me with it, since I am not that good with biology, as they are. And, secondly, if you are doing homology modelling, you should really focus on following links at the uniprot and read a lot about all the ligands and residues, protein family and it’s functions, following all the articles provided, than, modelling itself, won’t make a deal. Cheers.

Abstract

Performing this project we are going to do a homology modelling of protein structure. Homology modelling is the technology used to build an atomic-resolution model of the target sequence based on the known structure of the related homologous protein. It is possible, because proteins with the more than 20% sequence identity usually have the same three-dimensional structure, which depends on the type of the residues. We are going to find homologous protein sequences, perform a structural alignment to select the common sequence parts and then we will use our template homologous protein to build a model of the target sequence using Modelleler software. After that we are going to analyze quality of the obtained model and try to interpret the results.

Introduction

Proteins are the basic bricks for building the Life. Not actually the very basic bricks, nucleic acids, encoding the information about proteins are, but the proteins do all the real job. There are several types how the proteins are used. They perform structural, catalytic, promoting functions. You muscles consist of them. Pupil in your eye is the crystallized form of the optic protein. Bunch of proteins used to control your emotional reactions and behavior, they are responsible for transmitting signals to the human cell to start division, attacking other cells or even dying. They can help some ions, required by the cell to survive to get through the cell membrane. So they are the worker bees who do all the job.

So that is why so important to know, what each protein, coded in the nucleotide chain does. For example, if you have your exome (meaningful part of the DNA code) sequenced, and you some differences in it from the other people, you would really like to know, what these differences do, wouldn’t you? They can be responsible for blocking some processes to spread, or vise verse, prevent cells from dying, provoking possibility of developing a cancer. But how people are able to determine, what each protein does?

Luckily, function of the protein depends on it 3D structure, because having 3D structure we can predict, which regions of this really complex molecule are responsible for acquiring a Zinc atom, for example. So it is really import to find out the 3D structure of the molecule. There are several ways to do that, including just computation of it’s structure using such famous project as FOLDING@HOME, but they are really computationally expensive. The more naive way to do it is to perform a X-ray diffraction or NMR spectroscopy. These two ways are really expensive and it takes a lot of time to find out the 3D structure, using them [1]. However, great demand for knowing the protein structure resulted in some really smart ways to do this job done.

And now we are talking about homology, or comparative modelling. Which is the easiest way to model protein structure in silico (computationally). It is based on two major facts: the structure of a protein, as we have said before, is uniquely determined by its amino acid sequence [2]. Knowing the sequence should suffice to obtain the structure. And the fact that during evolution the structure of protein changes slowly than the associated sequence, being for sustainable. So similar sequences adopt practically identical structures and even not that much related sequences still are able to fold into similar structures [3]. It was shown by Rost in 1999, two sequences are practically guaranteed to adopt a similar structure if they have certain percentage of identical residues

Figure 1. The two zones of sequence alignments. Two sequences are practically guaranteed to fold into the same structure if their length and percentage sequence identity fall into the region marked as ‘‘safe.’’ An example of two sequences with 150 amino acids, 50% of which are identical, is shown (gray cross). [4]

So, if want to find out the structure of the target sequence, we compare it to sequences of known structures, stored in the PDB [5], with the using of UniProt and BLAST (Basic Local Alignment Search Tool) search and find a sequence, containing a region, which matches our target sequence with, for instance, 60% identical residues. Since it is rather good alignment result, we can use found sequence as the template for modelling structure of the target sequence. We just need to cut the matched region out of the template sequence, work with the amino acids, that differ between target and template sequences and then, apply corresponding tools.

This tools will arrange the backbone identically to the selected template. This means that not only position of alpha carbons, but also the phi and psi angles and secondary structure, are made identical to the template. Next, we can use some more tools to adjust sidechain positions to minimize collisions, and to try to do energy minimization to improve the model.

The quality of the sequence alignment is really important, because misplaces indels (gaps, caused by insertion or deletion) will make residues to be adjusted in incorrect space. So after performing automated alignment it is strictly recommended to analyze results manually [6]. All the steps of the research are thoroughly describe below.

Research

During this project, we want to obtain an annotation of four protein sequences using homology modelling. These four addressed sequence are obtained from Pleurotus eryngii, Vitis vinifera, Bos taurus and Arabidopsis thaliana. Below is the description of the methods we are going to use and the suggested workflow.

To begin with, we want to find out if the target sequence had been already annotated, and if it had not been annotated yet, we want to find out more about the name of the protein, the organism, it had been obtained from, and different characteristics of the sequence, such as its structure , function, ligands and the so forth. The best choice for this job is one of the biggest databases for protein sequences – UniProtKB [7]. Not only it provides the required information, but also supplies links for PDB entries for the related sequences.

The UniProt provides the ability to perform a BLAST search, which compares the target sequence with the proteins in the selected database. For this research we are running comparison against the PDB, since every structure of the PDB has the .pdb file with the molecular structure, containing information about the 3D coordinates of each atom, which can be used as the template for the modelling. After performing the search, we receive the list of the similar structures with a different parameters. These several parameters are considered to choose the best template for the modelling:

The E-Value (Expectation value ), which is responsible for the significance of the alignment, the lower it is, the better is the found structure.
The length of the alignment, which is really import, since we want template sequence to contain the functional core of the protein.
The alignment identity score which is the percentage of identical amino acids over the template (aligned) sequence . We will address Fig. 1 to decide, whether this identity score is good, or not in each case.

So, we are able to choose the template for the modeling, but at first we do need to perform a global alignment procedure between the target sequence and the found template. We can do this by using ClustalW2 [8] or LALIGN [9]. For this research, we chose the ClustalW2, aligning with the Blosum matrix weights. One of the benefits of using the ClustalW2 service, is that we can obtain the result in the PIR format.

Which, in its turn, can be used as the input file for the Modeller software tool [10]. It is used for homology modeling of a protein 3D structure. For the purporses of this research, the most fresh version 9.12 was used. Modeller is capable of modeling all non-hydrogen atoms of the target sequence, provided the template for it. This software is also capable of calculating the position of the absent in the template structure atoms, by optimizing the value of the objective function, with the respect to electronic density formula, ligands and loops. The Modeller computes several models (10, in our case, which is set in the model.py option file), from which we choose the one with the lowest objective function score. After this, only step of the model validation remains.

For model evaluation and validation, we use the PROCHECK software tool (v.3.4.3) [11]. It returns a set of plots, describing the quality of the acquired model, from which we are mostly interested in the Ramachandran plot, which shows the percentage of residues in the allowed region of the graph. The good model is supposed to have not less than 90% of them in the right place. Also, for the purpose of the quality control of the obtained model, jCE (Combinatorial Extension) tool [12] is used, to calculate the RMSD (Root Mean Square Deviation) value between two sequences (target and the template one. The lower it is, the more similar resulting models are, and the better quality of the model can be stated.

In the end, we address the SwissMod database, to compare modelled structures to the available precomputed models and make conclusions about their similarities or differences.

So, the short version of the suggested workflow is as follows:

Use the BLAST search on the UniProt website to find a suitable similar sequence and get its .pdb file.
Then, use ClustalW2 to perform a global alignment between target and a template and to get the PIR file.
Run a Modeller script to model a 3D structure of a protein, based on the selected template.
And, finally, perform a validation of the obtained model with the use of Procheck and jCE. If the results does not seem good enough, go to step

First protein. 3D modelling of the Laccase 4 from Pleurotus eryngii

Organism: Pleurotus eryngii (Boletus of the steppes) (Funghi)

Enzyme that catalyzes the oxidation of one compound with the reduction of another

Molecular function: Oxidoreductase activity (copper ion binding)

Length: 533 residues

UniProt entry: B0JDP9

This protein is kept in the TrEMBL part of the UniProt, which is automatically annotated. This protein plays regulatory role in biochemical processes. It is a copper ion binding protein, with oxidoreductase activity. It consists of the signal peptide (1 to 20 AAs), followed by the Laccase-2 chain (21-519), which is responsible for binding of the coppers. [13]

Figure 2.1: Key sites for the UniProt entry B0JDP9

Following the suggested above approach, at first, we perform a BLAST search for the sequence with the known structure, to use it as the template. As shown as the Figure 2.2 , we have a lot of templates with high identity score. Since the sequence ID: Q12718 (Laccase-2 of Trameter versicolor) with the highest identity score (62%), E-value which equals to 0, belongs to fungus as our target, and is kept in the Swiss-Prot part of the UniProt, which is manually annotated, we can call it the best choice.

Figure 2.2: Results of the BLAST search for the UniProt B0JDP9.

The protein with the UniProt ID Q12718 has the associated PDB entry 1GYC, which was obtained by using X-ray crystallography with the resolution of 1.9 Å . The length is 519 AA.

After running structural alignment at ClustalW, we see two parts of the target sequence, which are covered by the template. First 20 amino acids belong to signal peptide, which abcense won’t interfere with protein function [13]. Neither is the C-terminal part of the target sequence is aligned with the template, so it will be cut out in the pir file.

Figure 2.3: ClustalW2 multiple sequence alignment for target sequence and 1GYC template

After a Modeller run, the lowest objective function score was associated with the 4th model: 3148.9258.

The image of the modelled protein structure is shown in the figure 2.4.

Figure 2.4: Model built with 1GYC template with copper ions

Figure 2.5: Ramachandran Plot of the 3D model. 91.2% of residues are in allowed regions.

Now we have to perform a validation step. As the result of the PROCHECK work, we give the Ramachandran Plot, which states that 91.2% of residues are in allowed regions, which allows us to state high quality of the model.

The next step, as was discussed above, is the superimposition of the model and the template, done by the JCE. The resulting RMSD is 0.24, Z-score is 8.3, Sequence Identity is 62% and the Sequence similarity equals to 72%. These results mean that we have a good model, difference between Sequence Identity and the Sequence similarity is explained by the fact that the protein structure is more conserved, than the sequence.

Figure 2.6: Result of the jCE superimposition of the 1GYC template and our target.99990004 sequence.

Figure 2.7: jCE superimposition

Finally, we need to compare our model to the one which can be automatically generated by the Swiss Model. As we can see from Fig.2.8, this model has only 3 ions modelled and has 61.74% of sequence identity score. Now we can state that we have acquired model of better quality, than the model, which can be automatically generated.

Figure 2.8: Automatically generated model for target sequence and 1MHM template.

Conclusion

This article shows on four examples, how building protein model by homology works. It is more cheap than obtaining the model by X-ray crystallography, or using the NMR method, and more effective than automatic homology modelling and annotating, which is made by Swiss-model.

Following suggested scheme of performing a BLAST search for template, choosing the best template, aligning the sequences, modeling the target sequence structure based on the aligned template and performing the validation, we do really want to focus our attention on the template selection. It is the most important part of the process, since we need to understand, which parts of the sequence will remain after protein expression, which domains are responsible for the protein function and must be present in the template, and which part of the target sequence can be omitted, if it, for example, contains the information about the signal peptide.

Performing this project resulted in understanding that this key part, which can never be done automatically, is the most important for the bioinformatician, because it requires understanding of the key functions of the sites, or the persistence to follow linked articles at the UniProt entry to find out more about key regions and its structures.

Bibliography

1. Goodsell, D. S., Burley, S. K. & Berman, H. M. Revealing structural views of biology. Biopolymers 99, 817–824 (2013).

2. Epstain, Goldberger, and Anfinsen, 1963

3. Hopf, T. A. et al. (2012): “Three-dimensional structures of membrane proteins from genomic sequencing” In: “Cell” 149, 1607-162

4. HOMOLOGY MODELING , Elmar Krieger, Sander B. Nabuurs, and Gert Vriend

5. Protein Data Bank, Jan 28, 2014, there are 96980 Structures

6. C. Chothia, A.M. Lesk. The relation between the divergence of sequence and structure in proteins. EMBO J. 5:823-826 (1986).

7. UniProt: the Universal Protein knowledgebase , http://uniprot.org

8. http://www.ebi.ac.uk/Tools/msa/clustalw2/

9. http://embnet.vital-it.ch/software/LALIGN_form.html

10. A. Fiser, R.K. Do, & A. Sali. Modeling of loops in protein structures, Protein Science 9. 1753-1773, 2000.

11. Laskowski, R., Rullmann, J. A., MacArthur, M., Kaptein, R. & Thornton, J. AQUA and PROCHECK-NMR: Programs for checking the quality of protein structures solved by NMR. J. Biomol. NMR 8, 477–486 (1996)

12. Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Engineering 11(9) 739-747

13. Isolation, molecular characterization and expression of ERY4 laccase gene of Pleurotus eryngii. Bleve G., Grieco F. Submitted (JUL-2007) to the EMBL/GenBank/DDBJ databases

14. http://www.uniprot.org/uniprot/Q7XZQ9

15. AtPng1p. The first plant transglutaminase. Della-Mea M., Caparros-Ruiz D., Claparols I., Serafini-Fracassini D., Rigau J. Plant Physiol. 135:2046-2054(2004)

16. Structural analysis of Arabidopsis thaliana chromosome 5. X. Sequence features of the regions of 3,076,755 bp covered by sixty P1 and TAC clones. Sato S., Nakamura Y., Kaneko T., Katoh T., Asamizu E., Kotani H., Tabata S. DNA Res. 7:31-63(2000)

17. http://www.ebi.ac.uk/thornton-srv/databases/cgi-bin/enzymes/GetPage.pl?ec_number=2.3.2.13

18. http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.e.d.b.f.c.html

Value generation

Computing threedimensional structure of a protein by homology modelling. Bioinformatics tutorial.