STOCK MARKET UPDATE

Ticker

6/recent/ticker-posts

Machine learning models for evaluation of domain Based classification of HIV-1 groups

Abstract ;

Dr. Anubha Dubey, Machine learning models for evaluation of domain based classification of AIDS HIV-1 groups, Onl J Bioinform 18(2):53-57, 2017. HIV-1 evolves through rapid accumulation of mutations and recombination which actively contribute to its genetic diversity producing many groups, types and subtypes, this is similar to protein domain sequences and structures that evolve function and exist independently from the rest of the protein chain. Each domain forms a compact 3D structure which is independently stable and folded. One protein may appear in a variety of evolutionarily related proteins. Software and methods such as SVM, HMM and Neural Networks for prediction of domains generate different results and accuracy for the same input. A machine learning model for classifying HIV 1 M, N, O group domains is described. The HIV-1 domain based classification model was developed using Uniprot database as input for SBASE, SMART, NCBI Conserved Domain, Scan Prosite and Phylodome with J48, Bayes Net, Naive Bayes and Bagging algorithms. Results showed that SBASE predicted 98.59% and other programs 95.07-97.18% domains.

 INTRODUCTION  ;


Human immunodeficiency virus (HIV)/AIDS have structural gag, pol, env and regulatory and
accessory genes  vif, vpu, nef,  tat, rev, vpr  [1, 2]. HIV  1 strains in  America  and Europe  are
genetically  diverse from  those  in  Africa  and  Asia  [1].  HIV-1  and  HIV-2  are  transmitted  by
sexual  contact,  blood,  and  mother  to  child,  and  cause  clinically  indistinguishable  AIDS
[2].HIV-2 is transmitted less and period between initial infection and illness is longer [1]. HIV
1 is predominant with groups HIV 1- M, outlier HIV 1-O and HIV 1-N with subtypes A-G. [2)
The  relatively  uncommon  HIV-2  type  is  concentrated  in  West  Africa  and  rarely  found
elsewhere  [3].  HIV  has  the  capacity  to  mutate  easily  and  rapidly  but  requires  a  host.
Heterogeneity of  the virus  complicates development  of vaccine and/or  therapeutic agents
[4]. 

Domains are building blocks of proteins and are structurally compact, independently folded
units forming a  stable 3D  structure  which may  exhibit evolutionary conservation,  typically
having  1  or  more  motifs  [18].  During  evolution,  these  have  been  duplicated,  fused  and
recombined, to produce proteins with novel structures and functions. Domains can vary in
length  between  25  amino  acids  up  to  500  amino  acids  and  can  exist  in  a  variety  of
evolutionary related proteins [18].. “Promiscuous” protein domains are found in association
with other domains and for sequence analysis one domain at a time should be studied [18]. 
Short domains such  as zinc  fingers are stabilized  by metal ions  or disulphide bridges  often
forming  functional units,  such  as calcium-binding, EF  domain and so  on [5].  Software and
methods  such as  SVM,  HMM and  Neural  Networks  for  prediction  of domains  l  generate
different results  and accuracy  for the same  input Reference?  This leads  to the  dilemma of
choosing  software  for  prediction  of  domains  required  for  a  potential  classifier  using
predicted  domain  from  input  sequence.  Attempts  have  been  made  by  various  research
groups to develop classifiers [6, 7, 8]. We describe a machine learning model for classifying
protein-domains in HIV-1 subtypes M, N and O.
 P c X P x c P x c P x c P (c)
Where
 P(C/x) is the posterior probability of class (c,target) given predictor(x,attributes)
 P(C) is the prior probability of class.
 P(x/c) is the likelihood which is the probability of predictor given class.
 P(x) is the prior probability of predictor
p(c/x) =p(x/c)p(c) / p(x)
This  approach  instead  of  requiring  all  the  attributes  to  be  conditionally  independent
specifies the exact pair of attributes that are conditionally independent [14].

J48  is a  decision  flowchart-like  tree structure,  where  each internal  node  (non  leaf node)
denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (or  terminal node) holds  a class label.  The topmost  node in  a tree is  the root  node.
Internal  nodes  are  denoted  by  rectangles,  and  leaf  nodes  are  denoted  by  ovals.  The
construction  of  decision  tree  classifiers  does  not  require  any  domain  knowledge  or
parameter setting, and therefore is appropriate for exploratory knowledge discovery [14].

Bagging also known as bootstrap aggregating repeatedly samples from a data set according
to  a  uniform  probability  distribution.  Each  bootstrap  sample  has  the  same  size  as  the
original  data  [14].  The proteins  used for  this  study  were  collected from  Unipot  database
[14].  The incomplete  sequences containing  fragments  were  removed. NRDB  program was
used to verify that none of the sequences were identical to each other in the data set.

 RESULTS AND DISCUSSION

 The instances taken  from Uniprot database  has been given as input to  SBASE,SMART,NCBI
Conserved Domain, Scan Prosite,Phylodome and using J48, Bayes Net, Naive Bayes, Bagging
algorithms  the  accuracies  obtained  are  given  in  Table  1  below.  SBASE  predicted  98.59%
domains. 
Table 1:  Comparative analysis of different software’s used with machine learning
algorithms:
      Software
Classifier
SBASE  SMART  NCBI  Scan Prosite  Phylodome
J48  98.59  71.83  95.07  96.47  97.18
Bayes Net  98.59  71.83  95.07  96.47  97.18
Naive bayes  98.59  71.83  95.07  96.47  97.18
Bagging  98.59  71.83  95.07  96.47  97.18

Effect of combining software to simulate the model and accuracy output is shown on Table
2 below
Table 2:  Comparative analysis of software accuracy by machine learning algorithms.
  Software
Classifier
SB+PHY  SB+SC  SB+SM  SM+NC  SC+SM  SB+NC  SB+SC  PHY+SM  PHY+NC  SC+ NC

J48  90.85  90.85  92.07  37.19  39.02  90.85  90.85  31.7  33.53  38.41
Naïve-Bayes  90.24  87.80  90.24  37.19  37.8  90.24  87.80  32.92  33.53  38.41
Bayes –Net  80.48  78.04  79.26  37.19  37.8  79.26  78.04  32.31  33.53  38.41
Bagging

90.85

90.85

90.85

36.58

39.02

90.85

90.85

30.48

33.53

38.41

Where  SB+PHY=  SBase+Phylodom,  SB+SC=SBase+Scan  prosite,SB+SM=SBase+smart,  SM+NC=SMART+NCBI  Conserved
domain,  SB+SC=SBase+Scanprosite,PHY+SM=phylodom+SMART,  PHY+NCBI-Phylodom+NCBI  scan  prosite,  SC+NCBI=Scan
prosite+NCBI conserved domain
 

A confusion matrix is a visualization tool used in supervised learning matching matrix. Each
column  of  the  matrix  represents  the  instances  in  a  predicted  class,  while  each  row
represents the instances in an actual class. One benefit of a confusion matrix is that it is easy
to  see  if  the  system  is  confusing  2  classes  commonly  mislabeling  one  as  another.  The
confusion matrix accurately classified HIV1 into M, N, and O groups is explained as (table 3):
 

Table 3: Confusion matrix generated by Naive Bayes
A  B  c  d  Classified as
27  0  1  0

M group
0  34  0  0

N group
0  0  24  0

O group
0  0  0  26  d-MNO group



Table 4: Accuracy by class:
TP Rate  FP Rate  Precision  Recall  F-measure  Groups
0.964  0  1  0.964  0.982  M
1  0  1  1  1  N
1

0.017

0.923

1

0.96

O

1  0  1  0.967  1  MNO

These can be calculated as:
tp
precision
tp fp



tp
recall
tp fn



tn
Truenegativerate
tn fp


tp tn
Accuracy
tp tn fp fn


  

2
precision recall
Fmeasure
precision recall





These equations represent a classification task wherein precision for a class is the number of
true positives (i.e. the number of items correctly labelled as belonging to the positive class)
57


divided by the total number of elements labelled as belonging to the positive class (i.e. the
sum of true positives and false positives, which are items incorrectly labelled as belonging to
the class). Recall  in this  context is defined  as the  number of true  positives divided by  the
total  number  of elements  that  actually belong  to  the  positive class  (i.e.  the  sum  of true
positives and false negatives, which are  items which were not  labelled  as belonging to the
positive class but should have been) [19]

he combination  of precision and  recall are the  F-measure  that  is the  weighted harmonic
mean of  precision and recall, or the Matthews correlation coefficient, which is a geometric
mean of the chance-corrected variants: It is important to know about accuracy. Accuracy is
a weighted arithmetic mean of Precision and Inverse Precision (weighted by Bias) as well as
a weighted arithmetic mean of Recall and Inverse Recall (weighted by Prevalence) [20].

With  HIV-1  M,  N  and  O  dataset  domains  using  SBASE,  SMART,  Scan  Prosite,  NCBI,  and
Phylodome  and  machine  learning  algorithms  J48,  Naive  bayes;  Bayes  Net  and  Bagging
(Tables 1, 2, 4)  shows that SBASE predicts domain with 98.59% accuracy. This study shows
the importance of protein domains in HIV SUBTYPES which will aid:
 
 Analysis of protein structure of HIV subtypes.
 Comparison of HIV-1 subtypes protein sequences  often is confined to the region of
the sequence, these regions often correspond to structural domains
 Prediction of protein function is based on protein domains
 Structural classifications are constructed using domains as building blocks  
 Multiple aspects contribute to the concept of structural domains: 
 Evolutionary aspect of HIV and its types
 Structural aspect of HIV and its types (compactness/independent folding of domain)  
 Functional aspect of HIV and its types (ability to carry function).
 Comparative analysis of HIV and their viruses in evolution, structure and functional
aspects.
 

CONCLUSION;

 The  domain  based  classification  of  HIV-1  groups  M,  N  and  O  leads  to  the  better understanding of HIV infection and its types. This work will help in the development of novel approaches  to wet  lab techniques  in  devising  novel  drugs and  therapeutic agents  for  HIV types and subtypes. The correlation of protein domain with its structure explored here can be useful to  obtain better insights about these  proteins .The accuracy prediction of  SBASE proves better in predicting protein domains in dataset given. It is definitely said that as more and more sequences are being updated in databases,  the  of the model developed is further improved.

Acknowledgements:
The  author  thanks  The  Department  of  Biotechnology,  New  Delhi  for  providing
Bioinformatics Infra Structures Facility at MANIT, Bhopal, for performing this study,
 
 Dr.ANUBHA DUBEY EDUCATION DIRECTOR & TRAINER - KANISHKSOCIALMEDIA
 (PHD,BIOINFORMATICS & MBA HR,)
 Phone N0,9993210963,9827649560
 ID;ashutoshdubey3489@gmail.com
 https://www.facebook.com/Kanishk-103603547852455/?modal=admin_todo_tour
https://ashutoshdubey3489.wixsite.com/kanishksocialmedia

Post a Comment

0 Comments

Custom Real-Time Chart Widget

'; (function() { var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true; dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js'; (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq); })();

market stocks NSC