Abstract ;
Dr. Anubha Dubey, Machine learning models for evaluation of domain based classification of AIDS HIV-1 groups, Onl J Bioinform 18(2):53-57, 2017. HIV-1 evolves through rapid accumulation of mutations and recombination which actively contribute to its genetic diversity producing many groups, types and subtypes, this is similar to protein domain sequences and structures that evolve function and exist independently from the rest of the protein chain. Each domain forms a compact 3D structure which is independently stable and folded. One protein may appear in a variety of evolutionarily related proteins. Software and methods such as SVM, HMM and Neural Networks for prediction of domains generate different results and accuracy for the same input. A machine learning model for classifying HIV 1 M, N, O group domains is described. The HIV-1 domain based classification model was developed using Uniprot database as input for SBASE, SMART, NCBI Conserved Domain, Scan Prosite and Phylodome with J48, Bayes Net, Naive Bayes and Bagging algorithms. Results showed that SBASE predicted 98.59% and other programs 95.07-97.18% domains.
INTRODUCTION ;
Human immunodeficiency virus (HIV)/AIDS have structural gag, pol, env and regulatory and
accessory genes vif, vpu, nef, tat, rev, vpr [1, 2]. HIV 1 strains in America and Europe are
genetically diverse from those in Africa and Asia [1]. HIV-1 and HIV-2 are transmitted by
sexual contact, blood, and mother to child, and cause clinically indistinguishable AIDS
[2].HIV-2 is transmitted less and period between initial infection and illness is longer [1]. HIV
1 is predominant with groups HIV 1- M, outlier HIV 1-O and HIV 1-N with subtypes A-G. [2)
The relatively uncommon HIV-2 type is concentrated in West Africa and rarely found
elsewhere [3]. HIV has the capacity to mutate easily and rapidly but requires a host.
Heterogeneity of the virus complicates development of vaccine and/or therapeutic agents
[4].
Domains are building blocks of proteins and are structurally compact, independently folded
units forming a stable 3D structure which may exhibit evolutionary conservation, typically
having 1 or more motifs [18]. During evolution, these have been duplicated, fused and
recombined, to produce proteins with novel structures and functions. Domains can vary in
length between 25 amino acids up to 500 amino acids and can exist in a variety of
evolutionary related proteins [18].. “Promiscuous” protein domains are found in association
with other domains and for sequence analysis one domain at a time should be studied [18].
Short domains such as zinc fingers are stabilized by metal ions or disulphide bridges often
forming functional units, such as calcium-binding, EF domain and so on [5]. Software and
methods such as SVM, HMM and Neural Networks for prediction of domains l generate
different results and accuracy for the same input Reference? This leads to the dilemma of
choosing software for prediction of domains required for a potential classifier using
predicted domain from input sequence. Attempts have been made by various research
groups to develop classifiers [6, 7, 8]. We describe a machine learning model for classifying
protein-domains in HIV-1 subtypes M, N and O.
P c X P x c P x c P x c P (c)
Where
P(C/x) is the posterior probability of class (c,target) given predictor(x,attributes)
P(C) is the prior probability of class.
P(x/c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor
p(c/x) =p(x/c)p(c) / p(x)
This approach instead of requiring all the attributes to be conditionally independent
specifies the exact pair of attributes that are conditionally independent [14].
J48 is a decision flowchart-like tree structure, where each internal node (non leaf node)
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (or terminal node) holds a class label. The topmost node in a tree is the root node.
Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals. The
construction of decision tree classifiers does not require any domain knowledge or
parameter setting, and therefore is appropriate for exploratory knowledge discovery [14].
Bagging also known as bootstrap aggregating repeatedly samples from a data set according
to a uniform probability distribution. Each bootstrap sample has the same size as the
original data [14]. The proteins used for this study were collected from Unipot database
[14]. The incomplete sequences containing fragments were removed. NRDB program was
used to verify that none of the sequences were identical to each other in the data set.
RESULTS AND DISCUSSION
The instances taken from Uniprot database has been given as input to SBASE,SMART,NCBI
Conserved Domain, Scan Prosite,Phylodome and using J48, Bayes Net, Naive Bayes, Bagging
algorithms the accuracies obtained are given in Table 1 below. SBASE predicted 98.59%
domains.
Table 1: Comparative analysis of different software’s used with machine learning
algorithms:
Software
Classifier
SBASE SMART NCBI Scan Prosite Phylodome
J48 98.59 71.83 95.07 96.47 97.18
Bayes Net 98.59 71.83 95.07 96.47 97.18
Naive bayes 98.59 71.83 95.07 96.47 97.18
Bagging 98.59 71.83 95.07 96.47 97.18
Effect of combining software to simulate the model and accuracy output is shown on Table
2 belowTable 2: Comparative analysis of software accuracy by machine learning algorithms.
Software
Classifier
SB+PHY SB+SC SB+SM SM+NC SC+SM SB+NC SB+SC PHY+SM PHY+NC SC+ NC
J48 90.85 90.85 92.07 37.19 39.02 90.85 90.85 31.7 33.53 38.41
Naïve-Bayes 90.24 87.80 90.24 37.19 37.8 90.24 87.80 32.92 33.53 38.41
Bayes –Net 80.48 78.04 79.26 37.19 37.8 79.26 78.04 32.31 33.53 38.41
Bagging
90.85
90.85
90.85
36.58
39.02
90.85
90.85
30.48
33.53
38.41
Where SB+PHY= SBase+Phylodom, SB+SC=SBase+Scan prosite,SB+SM=SBase+smart, SM+NC=SMART+NCBI Conserved
domain, SB+SC=SBase+Scanprosite,PHY+SM=phylodom+SMART, PHY+NCBI-Phylodom+NCBI scan prosite, SC+NCBI=Scan
prosite+NCBI conserved domain
A confusion matrix is a visualization tool used in supervised learning matching matrix. Each
column of the matrix represents the instances in a predicted class, while each row
represents the instances in an actual class. One benefit of a confusion matrix is that it is easy
to see if the system is confusing 2 classes commonly mislabeling one as another. The
confusion matrix accurately classified HIV1 into M, N, and O groups is explained as (table 3):
Table 3: Confusion matrix generated by Naive Bayes
A B c d Classified as
27 0 1 0
M group
0 34 0 0
N group
0 0 24 0
O group
0 0 0 26 d-MNO group
.
Table 4: Accuracy by class:
TP Rate FP Rate Precision Recall F-measure Groups
0.964 0 1 0.964 0.982 M
1 0 1 1 1 N
1
0.017
0.923
1
0.96
O
1 0 1 0.967 1 MNO
These can be calculated as:
tp
precision
tp fp
tp
recall
tp fn
tn
Truenegativerate
tn fp
tp tn
Accuracy
tp tn fp fn
2
precision recall
Fmeasure
precision recall
These equations represent a classification task wherein precision for a class is the number of
true positives (i.e. the number of items correctly labelled as belonging to the positive class)
57
divided by the total number of elements labelled as belonging to the positive class (i.e. the
sum of true positives and false positives, which are items incorrectly labelled as belonging to
the class). Recall in this context is defined as the number of true positives divided by the
total number of elements that actually belong to the positive class (i.e. the sum of true
positives and false negatives, which are items which were not labelled as belonging to the
positive class but should have been) [19]
Conserved Domain, Scan Prosite,Phylodome and using J48, Bayes Net, Naive Bayes, Bagging
algorithms the accuracies obtained are given in Table 1 below. SBASE predicted 98.59%
domains.
Table 1: Comparative analysis of different software’s used with machine learning
algorithms:
Software
Classifier
SBASE SMART NCBI Scan Prosite Phylodome
J48 98.59 71.83 95.07 96.47 97.18
Bayes Net 98.59 71.83 95.07 96.47 97.18
Naive bayes 98.59 71.83 95.07 96.47 97.18
Bagging 98.59 71.83 95.07 96.47 97.18
Effect of combining software to simulate the model and accuracy output is shown on Table
2 belowTable 2: Comparative analysis of software accuracy by machine learning algorithms.
Software
Classifier
SB+PHY SB+SC SB+SM SM+NC SC+SM SB+NC SB+SC PHY+SM PHY+NC SC+ NC
J48 90.85 90.85 92.07 37.19 39.02 90.85 90.85 31.7 33.53 38.41
Naïve-Bayes 90.24 87.80 90.24 37.19 37.8 90.24 87.80 32.92 33.53 38.41
Bayes –Net 80.48 78.04 79.26 37.19 37.8 79.26 78.04 32.31 33.53 38.41
Bagging
90.85
90.85
90.85
36.58
39.02
90.85
90.85
30.48
33.53
38.41
Where SB+PHY= SBase+Phylodom, SB+SC=SBase+Scan prosite,SB+SM=SBase+smart, SM+NC=SMART+NCBI Conserved
domain, SB+SC=SBase+Scanprosite,PHY+SM=phylodom+SMART, PHY+NCBI-Phylodom+NCBI scan prosite, SC+NCBI=Scan
prosite+NCBI conserved domain
A confusion matrix is a visualization tool used in supervised learning matching matrix. Each
column of the matrix represents the instances in a predicted class, while each row
represents the instances in an actual class. One benefit of a confusion matrix is that it is easy
to see if the system is confusing 2 classes commonly mislabeling one as another. The
confusion matrix accurately classified HIV1 into M, N, and O groups is explained as (table 3):
Table 3: Confusion matrix generated by Naive Bayes
A B c d Classified as
27 0 1 0
M group
0 34 0 0
N group
0 0 24 0
O group
0 0 0 26 d-MNO group
.
Table 4: Accuracy by class:
TP Rate FP Rate Precision Recall F-measure Groups
0.964 0 1 0.964 0.982 M
1 0 1 1 1 N
1
0.017
0.923
1
0.96
O
1 0 1 0.967 1 MNO
These can be calculated as:
tp
precision
tp fp
tp
recall
tp fn
tn
Truenegativerate
tn fp
tp tn
Accuracy
tp tn fp fn
2
precision recall
Fmeasure
precision recall
These equations represent a classification task wherein precision for a class is the number of
true positives (i.e. the number of items correctly labelled as belonging to the positive class)
57
divided by the total number of elements labelled as belonging to the positive class (i.e. the
sum of true positives and false positives, which are items incorrectly labelled as belonging to
the class). Recall in this context is defined as the number of true positives divided by the
total number of elements that actually belong to the positive class (i.e. the sum of true
positives and false negatives, which are items which were not labelled as belonging to the
positive class but should have been) [19]
he combination of precision and recall are the F-measure that is the weighted harmonic
mean of precision and recall, or the Matthews correlation coefficient, which is a geometric
mean of the chance-corrected variants: It is important to know about accuracy. Accuracy is
a weighted arithmetic mean of Precision and Inverse Precision (weighted by Bias) as well as
a weighted arithmetic mean of Recall and Inverse Recall (weighted by Prevalence) [20].
With HIV-1 M, N and O dataset domains using SBASE, SMART, Scan Prosite, NCBI, and
Phylodome and machine learning algorithms J48, Naive bayes; Bayes Net and Bagging
(Tables 1, 2, 4) shows that SBASE predicts domain with 98.59% accuracy. This study shows
the importance of protein domains in HIV SUBTYPES which will aid:
Analysis of protein structure of HIV subtypes.
Comparison of HIV-1 subtypes protein sequences often is confined to the region of
the sequence, these regions often correspond to structural domains
Prediction of protein function is based on protein domains
Structural classifications are constructed using domains as building blocks
Multiple aspects contribute to the concept of structural domains:
Evolutionary aspect of HIV and its types
Structural aspect of HIV and its types (compactness/independent folding of domain)
Functional aspect of HIV and its types (ability to carry function).
Comparative analysis of HIV and their viruses in evolution, structure and functional
aspects.
mean of precision and recall, or the Matthews correlation coefficient, which is a geometric
mean of the chance-corrected variants: It is important to know about accuracy. Accuracy is
a weighted arithmetic mean of Precision and Inverse Precision (weighted by Bias) as well as
a weighted arithmetic mean of Recall and Inverse Recall (weighted by Prevalence) [20].
With HIV-1 M, N and O dataset domains using SBASE, SMART, Scan Prosite, NCBI, and
Phylodome and machine learning algorithms J48, Naive bayes; Bayes Net and Bagging
(Tables 1, 2, 4) shows that SBASE predicts domain with 98.59% accuracy. This study shows
the importance of protein domains in HIV SUBTYPES which will aid:
Analysis of protein structure of HIV subtypes.
Comparison of HIV-1 subtypes protein sequences often is confined to the region of
the sequence, these regions often correspond to structural domains
Prediction of protein function is based on protein domains
Structural classifications are constructed using domains as building blocks
Multiple aspects contribute to the concept of structural domains:
Evolutionary aspect of HIV and its types
Structural aspect of HIV and its types (compactness/independent folding of domain)
Functional aspect of HIV and its types (ability to carry function).
Comparative analysis of HIV and their viruses in evolution, structure and functional
aspects.
CONCLUSION;
The domain based classification of HIV-1 groups M, N and O leads to the better understanding of HIV infection and its types. This work will help in the development of novel approaches to wet lab techniques in devising novel drugs and therapeutic agents for HIV types and subtypes. The correlation of protein domain with its structure explored here can be useful to obtain better insights about these proteins .The accuracy prediction of SBASE proves better in predicting protein domains in dataset given. It is definitely said that as more and more sequences are being updated in databases, the of the model developed is further improved.
Acknowledgements:
The author thanks The Department of Biotechnology, New Delhi for providing
Bioinformatics Infra Structures Facility at MANIT, Bhopal, for performing this study,
The author thanks The Department of Biotechnology, New Delhi for providing
Bioinformatics Infra Structures Facility at MANIT, Bhopal, for performing this study,
Dr.ANUBHA DUBEY EDUCATION DIRECTOR & TRAINER - KANISHKSOCIALMEDIA
(PHD,BIOINFORMATICS & MBA HR,)
Phone N0,9993210963,9827649560
0 Comments