Abstract;
The structure of a protein can reveal its function and its evolutionary history. Extracting this information requires knowledge of the structure and its relationship with other proteins. Secondary structures of protein are compact with helices and strands. Hence there is a need for development of computational techniques for prediction and classification of HIV-1and HIV-2 protein (enzymes) structures. In this paper a machine learning model has been developed for classification of alpha, beta and residues of HIV ribonuclease, HIV reverse transcriptase, protease, integrase, and these four types of HIV enzymes are present in HIV1 & HIV2 cycle. Various machine learning algorithms such as J48, Rotation Forest, and Random Forest have been used to classify alpha, beta and residues of HIV reverse transcriptase, protease, ribonuclease, integrase and model developed gives fair accuracy. The information generated from these models can be of great use in clinical applications.
INTRODUCTION
Human immunodeficiency virus ( HIV) is a lentivirus (a
member of the retrovirus family) that causes acquired
immunodeficiency syndrome (AIDS) [1,2]. HIV is of two
types-HIV-1 & HIV-2 HIV-1 is the virus that was initially
discovered and termed both LAV and HTLV-III. It is more
virulent, more infective, [3] and is the cause of the majority
of HIV infections globally. The lower infectivity of HIV-2
compared to HIV-1 implies that fewer of those exposed to
HIV-2 will be infected per exposure. Because of its
relatively poor capacity for transmission, HIV-2 is largely
confined to West Africa [4]. HIV is different in structure
from other retroviruses. It is roughly spherical [5] with a
diameter of about 120 nm, around 60 times smaller than a
red blood cell, yet large for a virus [6]. It is composed of two
copies of positive single-stranded RNA that codes for the
virus's nine genes enclosed by a conical capsid composed of
2,000 copies of the viral protein p24 [7] The single-stranded
RNA is tightly bound to nucleocapsid proteins, p7 and
enzymes needed for the development of the virion such as
reverse transcriptase, protease, ribonuclease and integrase. A
matrix composed of the viral protein p17 surrounds the
capsid ensuring the integrity of the virion particle [8]. HIV
enters macrophages and CD4+ T-cells by the adsorption of
glycoproteins on its surface to receptors on the target cell
followed by fusion of the viral envelope with the cell
membrane and the release of the HIV capsid into the cell
[9,10] After HIV has bound to the target cell, the HIV RNA
and various enzymes, including reverse transcriptase,
integrase, ribonuclease, and protease, are injected into the
cell during the microtubule based transport to the nucleus,
the viral single strand RNA genome is transcribed into
double strand DNA, which is then integrated into a host
chromosome [11] After the viral capsid enters the cell, an
enzyme called reverse transcriptase liberates the single-
stranded (+) RNA genome from the attached viral proteins
and copies it into a complementary DNA (cDNA) molecule
[12]. The process of reverse transcription is extremely error-
prone, and the resulting mutations may cause drug resistance
or allow the virus to evade the body's immune system. The
reverse transcriptase also has ribonuclease activity that
degrades the viral RNA during the synthesis of cDNA, as
well as DNA-dependent DNA polymerase activity that
creates a sense DNA from the antisense cDNA [13].
Together, the cDNA and its complement form a double-
stranded viral DNA that is then transported into the cell
nucleus. The integration of the viral DNA into the host cell's
genome is carried out by another viral enzyme called
integrase [14]. The final step of the viral cycle, assembly of
new HIV-1 virons, begins at the plasma membrane of the
host cell. During maturation, HIV proteases cleave the
polyproteins into individual functional HIV proteins and
enzymes. The various structural components then assemble
to produce a mature HIV virion [15]. This cleavage step canA Computational Approach To Classify HIV Secondary Structure Of Enzymes
2 of 6
be inhibited by protease inhibitors. The mature virus is then
able to infect another cell. Enzymes made of proteins. Hence
secondary structure plays an important role.
Secondary structures of protein are compact with helices and
strands. Hence there is a need for development of
computational techniques for prediction and classification of
HIV-1and HIV-2 protein (enzymes) structures. In this paper
a machine learning model has been developed for
classification of alpha, beta and residues of HIV
ribonuclease, HIV reverse transcriptase, protease, integrase,
and these four types of HIV enzymes are present in HIV1
&HIV2 cycle [19,20,21,22] as given in Figure1. Various
machine learning algorithms such as J48, Rotation Forest,
and Random Forest have been used to classify alpha, beta
and residues of HIV reverse transcriptase, protease,
ribonuclease, integrase and model developed gives fair
accuracy. The information generated from these models can
be of great use in clinical applications and to understand
HIV structure better. As these are the better drug targets.
: Replication cycle of HIV shows role of enzymes:
Method: Here the protein secondary structure data has been
taken from PDB (Protein data bank) [15] of which the
present work focuses on the further classification of
according to alpha, beta and residue. Various algorithms of
machine learning are available for classification and
prediction of alpha, beta and residues. It has been developed
using different algorithms of WEKA classifier [16]. Thus,
for the same input they give different result and also differ in
accuracy. This variation in result and accuracy leads to
dilemma of choosing algorithm for classification and
prediction of alpha, beta and residues. Classification using
merely the predicted domain from the input sequence. From
the various algorithms J48, Random Forest and Rotation
Forest gives the better result with fair accuracies.
J48: A decision tree (or tree diagram) is a decision support
tool that uses a tree-like graph or model of decisions and
their possible consequences, including chance event
outcomes, resource costs, and utility. Decision trees are
commonly used in operations research, specifically in
decision analysis, to help identify a strategy most likely to
reach a goal. Another use of decision trees is as a descriptive
means for calculating conditional probabilities. In data
mining and machine learning, a decision tree is a predictive
model; that is, a mapping from observations about an item to
conclusions about its target value. More descriptive names
for such tree models are classification tree (discrete
outcome) or regression tree (continuous outcome). In these
tree structures, leaves represent classifications and branches
represent conjunctions of features that lead to those
classifications. The machine learning technique for inducing
a decision tree from data is called decision tree learning, or
(colloquially) decision trees [17].
Random Forest is a class of ensemble method specially
designed for decision tree classifiers .It combines the
prediction made by multiple decision trees where each tree is
generated based on the value of an independent set of
random vectors .The random vectors are generated from a
fixed probability distribution .Bagging using decision trees
is a special case of random forests ,where randomness is
injected into the model building process by randomly
choosing N samples with replacement ,from the original
training set. It has been theoretically proved that the upper
bound for generalization error of random forests converges
to the following expression when the number of trees is
sufficiently large.Where ρ is the average correlation among the trees and s is a
quantity that measures the strength of the tree classifier. The
strength of a set of classifier refers to the average
performance of the classifier where performance is measured
probabilistically in terms of the classifier margin.
Where Yθ is the predicted class of X according to a classifier
built from some random vector θ. The higher the margin is,
the more likely it is that the classifier correctly predicts a
given example X [17].
Rotation Forest: It is built with a set of decision trees. For
each tree, the bootstrap samples extracted from the original
training set are adopted to construct a new training set. Then
the feature set of the new training set is randomly split into
some subsets, which are transformed with a linear
transformation method individually. Consequently, a full
feature set is reconstructed with all the transformed features
for each tree in the ensemble. Since a small rotation of axes
may build a complete different tree, the diversity of the
ensemble system can be guaranteed by the
transformation.[18]
RESULT & DISCUSSION ;
To achieve our goal and develop our methodology we
obtained the dataset from Protein Data Bank (PDB) for both
HIV-1 & HIV-2. The following six cases arises for
classification of HIV-1 & HIV-2 enzymes. PDB
Classification according to HIV Reverse Transcriptase, HIV
Protease, and HIV ribonuclease by J48, Random forest,
Rotation Forest will give the following results.
The confusion matrix of alpha+beta of HIV-1 & HIV-2
generated from the above is given as under:
The Detailed Accuracy developed By Class is shown as-
ROC: Receiver Operating Curve (ROC) is a graphical
technique for evaluating data mining schemes. ROC curves
depicts the performance of a classifier without regard to
class distribution or error costs .They plot the number of
positives included in the samples on the vertical axis,
expressed as a percentage of the total number of positives,
against the total number of negatives on the horizontal axis.
For each fold of a 10 fold cross validation, weight
the instances for a selection of different cost ratios train the
scheme on each weighted set, count the true positives and
false positives in the test set, and plot the resulting point on
the ROC axes. The ROC curves for different classes have
been plotted as shown in Figures (1,2,). As ROC depicts the
performance, we can refer from the confusion matrix that in
case of HIV-1 class, the false positive ratio is 0.643, which
clearly indicates that the true positive ratio is 0.981. The
accuracy of results for these two classes obtained from all
the three classifiers is presented as follows and accuracies of
the above classifiers are also presented.
A Computational Approach To Classify HIV Secondary Structure Of Enzymes
Thus we observe that out of the 346 instances of HIV
reverse transcriptase, protease, integrase, and ribonuclease
taken for cross validation 322 were classified correctly
whereas 24 were classified incorrectly by rotation forest
classifier. This accounts to 93.0636 % accuracy which was
the highest among all the three classifier used here so far.
Thus the above classifier is able to classify HIV-1 and
HIV-2 for which no algorithm has been reported in the
literature so far. We can increase the instances by adding
secondary structure data of other organisms like mouse, rat,
pig and others but it does not give any significant change.
This implies that the human instances are alone sufficient to
develop the classifier. The reason is that similarity is 75-85%
for enzyme structures among human and other organism.
Hence inclusion of secondary structure data of other
organisms will not only increase the instances but also
increase the redundancy. The same model can be applied for
organism like mouse, rat etc. for which secondary structure
information is available in Protein Data Bank which is
structure database of protein.
CONCLUSION ;
The above classifier takes into account the secondary
structure of all the known 346 HIV enzymes as the rotation
forest classifier performs the best among all the three
classifiers, it qualifies as most suitable choice for
classification and prediction. The authors wish to incorporate
it as soon as more information is available in the future. The
above model is useful for generating information which can
be of great use in prediction of structure and function of all
the enzyme structures present since they are key drug
targets. The protein structure belonging to a particular class
will have functional domains, alpha and beta sheet
A Computational Approach To Classify HIV Secondary Structure Of Enzymes
5 of 6
corresponding to that class which will ease in locating the
active site(s) as well as the binding site(s) in the classified
protein and hence it can be the potential active site or
binding site for the drug. As more structures of HIV
enzymes are discovered the above classifier can be trained to
improve the accuracy of results.
ACKNOWLEDGEMENT
The authors are highly thankful to Department of
biotechnology, New Delhi for providing Bioinformatics
Infra Structures Facility at MANIT, Bhopal for carrying out
this work.
DR ANUBHA DUBEY INDEPENDENT RESEARCHER,
PHONE NO,9993210963,9827649560
ID ; ashutoshdubey3489@gmail.com
https://www.facebook.com/pg/Kanishk-103603547852455/posts/
The structure of a protein can reveal its function and its evolutionary history. Extracting this information requires knowledge of the structure and its relationship with other proteins. Secondary structures of protein are compact with helices and strands. Hence there is a need for development of computational techniques for prediction and classification of HIV-1and HIV-2 protein (enzymes) structures. In this paper a machine learning model has been developed for classification of alpha, beta and residues of HIV ribonuclease, HIV reverse transcriptase, protease, integrase, and these four types of HIV enzymes are present in HIV1 & HIV2 cycle. Various machine learning algorithms such as J48, Rotation Forest, and Random Forest have been used to classify alpha, beta and residues of HIV reverse transcriptase, protease, ribonuclease, integrase and model developed gives fair accuracy. The information generated from these models can be of great use in clinical applications.
Human immunodeficiency virus ( HIV) is a lentivirus (a
member of the retrovirus family) that causes acquired
immunodeficiency syndrome (AIDS) [1,2]. HIV is of two
types-HIV-1 & HIV-2 HIV-1 is the virus that was initially
discovered and termed both LAV and HTLV-III. It is more
virulent, more infective, [3] and is the cause of the majority
of HIV infections globally. The lower infectivity of HIV-2
compared to HIV-1 implies that fewer of those exposed to
HIV-2 will be infected per exposure. Because of its
relatively poor capacity for transmission, HIV-2 is largely
confined to West Africa [4]. HIV is different in structure
from other retroviruses. It is roughly spherical [5] with a
diameter of about 120 nm, around 60 times smaller than a
red blood cell, yet large for a virus [6]. It is composed of two
copies of positive single-stranded RNA that codes for the
virus's nine genes enclosed by a conical capsid composed of
2,000 copies of the viral protein p24 [7] The single-stranded
RNA is tightly bound to nucleocapsid proteins, p7 and
enzymes needed for the development of the virion such as
reverse transcriptase, protease, ribonuclease and integrase. A
matrix composed of the viral protein p17 surrounds the
capsid ensuring the integrity of the virion particle [8]. HIV
enters macrophages and CD4+ T-cells by the adsorption of
glycoproteins on its surface to receptors on the target cell
followed by fusion of the viral envelope with the cell
membrane and the release of the HIV capsid into the cell
[9,10] After HIV has bound to the target cell, the HIV RNA
and various enzymes, including reverse transcriptase,
integrase, ribonuclease, and protease, are injected into the
cell during the microtubule based transport to the nucleus,
the viral single strand RNA genome is transcribed into
double strand DNA, which is then integrated into a host
chromosome [11] After the viral capsid enters the cell, an
enzyme called reverse transcriptase liberates the single-
stranded (+) RNA genome from the attached viral proteins
and copies it into a complementary DNA (cDNA) molecule
[12]. The process of reverse transcription is extremely error-
prone, and the resulting mutations may cause drug resistance
or allow the virus to evade the body's immune system. The
reverse transcriptase also has ribonuclease activity that
degrades the viral RNA during the synthesis of cDNA, as
well as DNA-dependent DNA polymerase activity that
creates a sense DNA from the antisense cDNA [13].
Together, the cDNA and its complement form a double-
stranded viral DNA that is then transported into the cell
nucleus. The integration of the viral DNA into the host cell's
genome is carried out by another viral enzyme called
integrase [14]. The final step of the viral cycle, assembly of
new HIV-1 virons, begins at the plasma membrane of the
host cell. During maturation, HIV proteases cleave the
polyproteins into individual functional HIV proteins and
enzymes. The various structural components then assemble
to produce a mature HIV virion [15]. This cleavage step canA Computational Approach To Classify HIV Secondary Structure Of Enzymes
2 of 6
be inhibited by protease inhibitors. The mature virus is then
able to infect another cell. Enzymes made of proteins. Hence
secondary structure plays an important role.
Secondary structures of protein are compact with helices and
strands. Hence there is a need for development of
computational techniques for prediction and classification of
HIV-1and HIV-2 protein (enzymes) structures. In this paper
a machine learning model has been developed for
classification of alpha, beta and residues of HIV
ribonuclease, HIV reverse transcriptase, protease, integrase,
and these four types of HIV enzymes are present in HIV1
&HIV2 cycle [19,20,21,22] as given in Figure1. Various
machine learning algorithms such as J48, Rotation Forest,
and Random Forest have been used to classify alpha, beta
and residues of HIV reverse transcriptase, protease,
ribonuclease, integrase and model developed gives fair
accuracy. The information generated from these models can
be of great use in clinical applications and to understand
HIV structure better. As these are the better drug targets.
: Replication cycle of HIV shows role of enzymes:
Method: Here the protein secondary structure data has been
taken from PDB (Protein data bank) [15] of which the
present work focuses on the further classification of
according to alpha, beta and residue. Various algorithms of
machine learning are available for classification and
prediction of alpha, beta and residues. It has been developed
using different algorithms of WEKA classifier [16]. Thus,
for the same input they give different result and also differ in
accuracy. This variation in result and accuracy leads to
dilemma of choosing algorithm for classification and
prediction of alpha, beta and residues. Classification using
merely the predicted domain from the input sequence. From
the various algorithms J48, Random Forest and Rotation
Forest gives the better result with fair accuracies.
J48: A decision tree (or tree diagram) is a decision support
tool that uses a tree-like graph or model of decisions and
their possible consequences, including chance event
outcomes, resource costs, and utility. Decision trees are
commonly used in operations research, specifically in
decision analysis, to help identify a strategy most likely to
reach a goal. Another use of decision trees is as a descriptive
means for calculating conditional probabilities. In data
mining and machine learning, a decision tree is a predictive
model; that is, a mapping from observations about an item to
conclusions about its target value. More descriptive names
for such tree models are classification tree (discrete
outcome) or regression tree (continuous outcome). In these
tree structures, leaves represent classifications and branches
represent conjunctions of features that lead to those
classifications. The machine learning technique for inducing
a decision tree from data is called decision tree learning, or
(colloquially) decision trees [17].
Random Forest is a class of ensemble method specially
designed for decision tree classifiers .It combines the
prediction made by multiple decision trees where each tree is
generated based on the value of an independent set of
random vectors .The random vectors are generated from a
fixed probability distribution .Bagging using decision trees
is a special case of random forests ,where randomness is
injected into the model building process by randomly
choosing N samples with replacement ,from the original
training set. It has been theoretically proved that the upper
bound for generalization error of random forests converges
to the following expression when the number of trees is
sufficiently large.Where ρ is the average correlation among the trees and s is a
quantity that measures the strength of the tree classifier. The
strength of a set of classifier refers to the average
performance of the classifier where performance is measured
probabilistically in terms of the classifier margin.
Where Yθ is the predicted class of X according to a classifier
built from some random vector θ. The higher the margin is,
the more likely it is that the classifier correctly predicts a
given example X [17].
Rotation Forest: It is built with a set of decision trees. For
each tree, the bootstrap samples extracted from the original
training set are adopted to construct a new training set. Then
the feature set of the new training set is randomly split into
some subsets, which are transformed with a linear
transformation method individually. Consequently, a full
feature set is reconstructed with all the transformed features
for each tree in the ensemble. Since a small rotation of axes
may build a complete different tree, the diversity of the
ensemble system can be guaranteed by the
transformation.[18]
RESULT & DISCUSSION ;
To achieve our goal and develop our methodology we
obtained the dataset from Protein Data Bank (PDB) for both
HIV-1 & HIV-2. The following six cases arises for
classification of HIV-1 & HIV-2 enzymes. PDB
Classification according to HIV Reverse Transcriptase, HIV
Protease, and HIV ribonuclease by J48, Random forest,
Rotation Forest will give the following results.
The confusion matrix of alpha+beta of HIV-1 & HIV-2
generated from the above is given as under:
The Detailed Accuracy developed By Class is shown as-
ROC: Receiver Operating Curve (ROC) is a graphical
technique for evaluating data mining schemes. ROC curves
depicts the performance of a classifier without regard to
class distribution or error costs .They plot the number of
positives included in the samples on the vertical axis,
expressed as a percentage of the total number of positives,
against the total number of negatives on the horizontal axis.
For each fold of a 10 fold cross validation, weight
the instances for a selection of different cost ratios train the
scheme on each weighted set, count the true positives and
false positives in the test set, and plot the resulting point on
the ROC axes. The ROC curves for different classes have
been plotted as shown in Figures (1,2,). As ROC depicts the
performance, we can refer from the confusion matrix that in
case of HIV-1 class, the false positive ratio is 0.643, which
clearly indicates that the true positive ratio is 0.981. The
accuracy of results for these two classes obtained from all
the three classifiers is presented as follows and accuracies of
the above classifiers are also presented.
A Computational Approach To Classify HIV Secondary Structure Of Enzymes
Thus we observe that out of the 346 instances of HIV
reverse transcriptase, protease, integrase, and ribonuclease
taken for cross validation 322 were classified correctly
whereas 24 were classified incorrectly by rotation forest
classifier. This accounts to 93.0636 % accuracy which was
the highest among all the three classifier used here so far.
Thus the above classifier is able to classify HIV-1 and
HIV-2 for which no algorithm has been reported in the
literature so far. We can increase the instances by adding
secondary structure data of other organisms like mouse, rat,
pig and others but it does not give any significant change.
This implies that the human instances are alone sufficient to
develop the classifier. The reason is that similarity is 75-85%
for enzyme structures among human and other organism.
Hence inclusion of secondary structure data of other
organisms will not only increase the instances but also
increase the redundancy. The same model can be applied for
organism like mouse, rat etc. for which secondary structure
information is available in Protein Data Bank which is
structure database of protein.
CONCLUSION ;
The above classifier takes into account the secondary
structure of all the known 346 HIV enzymes as the rotation
forest classifier performs the best among all the three
classifiers, it qualifies as most suitable choice for
classification and prediction. The authors wish to incorporate
it as soon as more information is available in the future. The
above model is useful for generating information which can
be of great use in prediction of structure and function of all
the enzyme structures present since they are key drug
targets. The protein structure belonging to a particular class
will have functional domains, alpha and beta sheet
A Computational Approach To Classify HIV Secondary Structure Of Enzymes
5 of 6
corresponding to that class which will ease in locating the
active site(s) as well as the binding site(s) in the classified
protein and hence it can be the potential active site or
binding site for the drug. As more structures of HIV
enzymes are discovered the above classifier can be trained to
improve the accuracy of results.
ACKNOWLEDGEMENT
The authors are highly thankful to Department of
biotechnology, New Delhi for providing Bioinformatics
Infra Structures Facility at MANIT, Bhopal for carrying out
this work.
DR ANUBHA DUBEY INDEPENDENT RESEARCHER,
PHONE NO,9993210963,9827649560
ID ; ashutoshdubey3489@gmail.com
https://www.facebook.com/pg/Kanishk-103603547852455/posts/
(With input from news agency language)
If you like this story, share it with a friend! We are a non-profit organization. Help us financially to keep our journalism free from government and corporate pressure.
0 Comments