Building HMM profile for the protein detection

Abstract

Background

Proteins are generally composed of one or more functional regions, which are called domains. These regions are independent folding units and can evolve, function, and exist independently of the rest of the protein chain.

Motivation

In this paper we describe using HMM to model the Kunitz domain. This model will be later validated and compared with the automatically curated Pfam database to understand benefits of manual model building.

Results

Regarding the comparison with the Pfam database model of a relatively good quality was built with the sensitivity of 0.994, specificity of 1, 0.927 precision and accuracy of almost 1.

Introduction

We want use modern techniques of machine learning to build HMM model that is able to identify the Kunitz-type inhibitor domain. Since domains are conserved structures and they are responsible for performing certain functions, this is crucially important for understanding protein function and performing protein annotation.

We focus on the task of building a model capable of identification of he BPTI/Kunitz-type inhibitor family, to which our target protein with PDB code 3OFW (UniProt code P31713) belongs. The crystal structure was obtained by x-ray diffraction with the resolution of 2.5 Å.

Figure 1: General 3D backbone structure of the 3OFW protein (from PDB).

It consists of 55 amino acids and was manually annotated, because it lies in the SwissProt part of the database. It is Kunitz/Bovine pancreatic trypsin inhibitor domain which displays activity not only against serine-, but also against cysteine-, and aspartate proteases [1].

According to PDB it belongs to the protein family with the PFAM accession number PF00014. Among other proteins containing the same domain and thus belonging to the same family can be named aprotinin (bovine pancreatic trypsin inhibitor, BPTI), Alzheimer’s amyloid precursor protein (APP), and tissue factor pathway inhibitor (TFPI). The domain itself has alpha-beta fold containing 6 precisely positioned cysteins involved in the formation of disulfide bonds. The φ angles of the residues from 17 to 30 are considered one of the most stable characteristics of this family [2].

We are going to describe the basic principles of Hidden Markov Models, which will make as able to recognize Kunitz-type domains in the sequences after building and validating the model.

Hidden Markov Models

Hidden Markov Models (or HMM shortly) is one of the models, widely used in machine learning.

Hidden Markov models are especially known for their application in temporal pattern recognition such as speech, handwriting, gesture recognition, part-of-speech tagging, musical score following, partial discharges and, of course, bioinformatics.

In the general case HMM is considered a part of the urn scheme. We can describe it as a set of several “urns”, called “states”. On each step we can choose this state independently from the history of previously chosen states. The probability of choosing the state is called state transition probability. Besides it, every state has the emission or output probability of giving a particular result. It is like taking a labeled balls out of several urns. The chance of choosing between urns will be called “transitional probability” and the probability of taking any numbered ball from an urn will be emission probability.

When the observer doesn’t know the states and the only parameters available for evaluation are the emitted states and transitional possibilities, the model is considered “hidden”. Different sequences of the obtained results can be evaluated with the help of HMMs to understand and probabilistic model hidden behind the process. So the observer can understand possible states and other model parameters.In general, this type of problem (i.e. finding the most likely explanation for an observation sequence) can be solved efficiently using the Viterbi algorithm [3].

Figure 2: General scheme of the Hidden Markov Process

HMM was one of the first machine learning approaches, applied in the bioinformatics. The DNA sequence was viewed as a stochastic process with local compositional properties determined by the states of a hidden Markov chain. [4].

Methods

Proteins from the same family share evolutional history and often share the same function. So understanding at least the protein family helps to determine its functionality [5]. With the help of HMMs the problem of determining the family becomes the question of building the model able to recognize the family of a protein by its sequence.

This problem can be solved using the following approach:

Selecting the training set for the model (assumes building multiple sequence alignment)
Training the model with a HMMER software.
Evaluating the performance of the obtained model with the search of the HMM profile in the protein sequence database.
Calculating the thresholds for uniting proteins in a family.

PDBeFold. Sequence alignment

Figure 3: Input form of the PDBeFold

We need a number of sequences, belonging to the Kunitz domain to train our model. Our target protein domain is the PDB entry 3OFW. Knowing PDB entry, we can use a tool called PDBeFold, which compares protein 3D structures. It’s based on the identification of residues occupying equivalent geometrical positions [6].

Figure 4: Results of the PDBeFold structural alignment

It is capable both of pairwise and multiple 3D alignment and provides different metrics for the evaluation of the results. These metrics are briefly described in table 1.

Selection of the thresholds for these metrics is a tricky task. The common problem for the machine learning methods is the overfitting — when the algorithm adopts to the training set and doesn’t learn to recognize the less similar patterns which can still produce the desired output. So we focus on the sequences with high structural similarity, paying less attention to the sequence similarity.
Table 1: Explanation of the sequence alignment metrics

The following thresholds were selected: RMSD lower or equal to 1.5 Å , Z-score greater than 6. That automatically selects sequences Q-scoring greater than 7. We were able to find 249 sequences with the satisfying parameters.

This is done with the help of the very simple bash script:

cat reslist.dat | tr ‘:’ ‘\t’ | awk ‘{if ($4 > 6) print $0}’ | awk ‘{if ($5 <=1.5) print $0 }’ | awk ‘{if ($9 > 0.3) print $0}’|awk ‘{print $2,$19,$20}’ > filtered_results.dat

Data wrangling

The obtained list of 249 sequences contained duplicates with different values of Q-score, so the removing of the redundant sequences was performed with common command line linux tools such as awk and a custom python script. Script helped to choose the sequences with highest Q-score between chains.

These thresholds after removing duplicates with lower Q-score provided us 129 sequences for which we run PDBeFold multiple alignment.

All the useful scripts are included into separate archive.

PDBeFold multiple alignment

We used 129 sequences in the format of the sequences name and the corresponding chain as the input for PDBeFold with the multiple option. Output of the multiple alignment is the fasta-coded file containing sequence, which is almost ready for using as the input to HMMER. We only had to transform them to the Stockholm format with the Jalview software.

Using HMMER for building a model

We used HMMER 3.0 software for building an HMM model with the list of the aligned sequences in the Stockholm format. The launch of the program happens with the following command:

$hmmbuild output.hmm alignment.stockholm

Which builds the model and saves it in the output.hmm file. HMMER uses set of the pipeline organized filters to build a resulting model and provides the results with the evaluated parameters [7].

The first section of the HMMer output contains information about the sequences. The most important parameter is the E-value, which is based on the sequence bit score — the log-odds score for the complete sequence. This parameter doesn’t depend on the size of the database and the lower it its, the more significant is the result.

Figure 5: First section of the HMMER 3.0 output

Figure 6: “internal pipeline statistics summary” HMMER3 output section

Model validation

After using the awk script, we obtained the predicted Kunitz-family proteins as the list of sequences. Then we addressed the Pfam database to download list of manually annotated Kunitz-family proteins to validate the results of our model.

In our case we questioned the hypothesis of sample belonging to the Kunitz family, so for the purposes of model validation we have to calculate number of correctly and incorrectly identified samples. In general, positive = identified correctrly and negative = rejected hypothesis. Therefore:

True positive = correctly identified
False positive = incorrectly identified
True negative = correctly rejected
False negative = incorrectly rejected

We used common statistics, which are explained on the picture 7. The four outcomes can be formulated in a 2×2 confusion matrix

Figure 7: Common statistics for the classification problems

For purpose of validation we used a short python script, which calculated these statistics, described in the table 2. Output of the script is displayed on the picture 8.

Name	Description	Value
Sensitivity or True Positive Rate (TPR)	Number of TP over the total of true positives	0.994
Specificity (Sp)	Number of TN over the total of true negatives	0.999954
False Positive Rate (FPR)	Number of FP over the total of true negatives	0.000046
Precision or Positive Predicted Value(PPV)	Number of TP over the total of predicted as positive	0.927
Accuracy (ACC)	Proportion of true results in the population	0.999950
Matthews Correlation Coefficient (MCC)	Describes the confusion matrix with a result of 1 represents a perfect prediction, 0 no better than random prediction and -1 indicates total completely wrong case	0.960

Table 2: Model validation results

Figure 8: Output of the validation script

Results

The downloaded file with the Swiss-prot entries contained 544996 annotated proteins, from which 317 were predicted correctly to be from Kunitz family (True Positive), 25 of them were considered belonging to family, but in fact they were not (False Positive), 544652 were recognized as True Negative proteins, not belonging to Kunitz domain and 2 proteins were not recognized by the model (False Negative).

These proteins are P17726 and P0DJ83. They scored relatively low metrics to be included in the threshold by the HMMer.

Figure 9: Inclusion threshold for the HMMER

High False Positive / False Negative shows that built model tends to include in the family proteins which are in fact don’t belong to it. For improving performance of the model we need to readjust initial thresholds for training the model.

Conclusions

The main goal of this project — studying the application of the HMM to recognizing protein domains was fulfilled. The obtained model can be considered relatively reliable and demonstrates the fact that manually curated profile HMM can demonstrate a really high accuracy, precision and correlation coefficient as compared to the Pfam annotation.

During the process of performing the project we were able not only to exercise the walkthrough for the manual HMM profile creation, but also to develop skills of evaluating performance of machine learning algorithms.

References

Structural insights into serine protease inhibition by a marine invertebrate BPTI Kunitz-type inhibitor. Garcia-Fernandez R1, Pons T, Perbandt M.
Structure of the recombinant BPTI/Kunitz-type inhibitor rShPI-1A from the marine invertebrate Stichodactyla helianthus. Garcia-Fernandez, R., Pons, T., Meyer, A., Perbandt, M., Gonzalez-Gonzalez, Y., Gil, D., de los Angeles Chavez, M., Betzel, C., Redecke, L.
What is a hidden Markov model?. Sean R Eddy Nature Biotechnology 22, 1315 – 1316 (2004) doi:10.1038/nbt1004-1315
Stochastic models for heterogeneous DNA sequences. Churchill GA. Bull Math Biol. 1989;51(1):79-94.
The Pfam protein families database (Nucleic Acids Research 2012). Marco Punta, Penny C. Coggil/
E. Krissinel and K. Henrick (2004). Secondary-structure matching (PDBeFold), a new tool for fast protein structure alignment in three dimensions. Acta Cryst. D60, 2256—2268
HMMER User’s Guide Biological sequence analysis using profile hidden Markov models. Sean R. Eddy