STOCK MARKET UPDATE

Ticker

6/recent/ticker-posts

Biochemical Parameters Based Structural Classification of HIV Proteins

Abstract
Solution NMR spectroscopy, X-ray crystallography, fiber diffraction, neutron diffraction are the fundamental structure determination methods in the development of many scientific fields. The methods also revealed the structure and functioning of many biological molecules, including vitamins, drugs, proteins and nucleic acids such as DNA. X-ray crystallography is still the chief method for characterizing the atomic structure of new materials and in discerning materials that appear similar by other experiments. Looking towards the importance of the structure determination methods the HIV enzyme secondary structures is taken to classify and predict on the basis of their biochemical and physicochemical methods like resolution, Ph, temperature, number of non hydrogen atoms and other parameters by machine learning techniques. Machine learning techniques are fast and economical and predict better results of classification and prediction. The accuracies of all the predicted parameters are predicted as 96.5217%.

Keywords---Accuracy,  Classification,  Prediction  Resolution,
Temperature
I. INTRODUCTION
IO molecular structure prediction is the prediction of the
three-dimensional  structure of a  protein  from  its  amino
acid sequence, or of a nucleic acid from its base sequence. In
other  words,  it  is  the  prediction  of  secondary  and  tertiary
structure from its primary structure. Structure prediction is the
inverse of bio  molecular design.  Protein structure prediction
is one of the most important goals pursued by bioinformatics
and  theoretical  chemistry.  Protein  structure  prediction  is  of
high importance in medicine (for example, in drug design) and
biotechnology  (for  example,  in  the  design  of  novel
enzymes)[1].  Due  to  their  importance,  the  HIV  protein
structure  can  be  studied  and  determined  to  find  out  drug
targets. Several methods are currently  used to  determine the
structure of a protein, including X-ray crystallography, NMR
spectroscopy,  and  electron  microscopy.  Each  method  has
advantages and disadvantages. In each of these methods, the
scientist uses  many pieces  of  information to  create the final
atomic  model.  Primarily,  the  scientist  has  some  kind  of
experimental data about the structure of the molecule. For X-
ray crystallography, this is the X-ray diffraction pattern.  For
NMR  spectroscopy,  it  is  information  on  the  local
                                                           

Manuscript received on July 29, 2011,  review  completed  on  August 17,
2011 and revised on August 25, 2011.
Anubha  Dubey,  Department  of  Bioinformatics,  MANIT,  BHOPAL
462051. E-Mail: anubhadubey@rediffmail.com
Dr.  Usha  Chauhan,  Department  of  Mathematics,  MANIT,  BHOPAL
462051.
Digital Object Identifier No. BB092011004
 Protein structure is closely linked with protein function, the
structural genomics has the potential to inform knowledge of
protein function. In addition to  elucidating protein functions,
structural genomics can be used to identify novel protein folds
and potential targets for drug discovery. Structural genomics
involves  taking  a  large  number  of  approaches  to  structure
determination, including experimental methods using genomic
sequences or modelling-based approaches based on sequence
or  structural  homology  to  a  protein  of  known  structure  or
based on chemical and physical principles for a protein with
no homology to any known structure. Experimental methods
of protein structure determination require proteins that express
and/or crystallize well, which may inherently bias the kinds of
proteins folds that this experimental data elucidate. Hence data
mining techniques are fast and accurate to find the accurate set
of  diagnostic  attributes  to  study  protein  structure  by
classification.  Classification  is  a  data  mining  function  that
assigns  items  in a  collection to  target categories  or classes.
The goal  of classification  is to  accurately predict  the target
class for each case in the data. Usually experimental methods
like  X-ray  crystallography(X  Diff.),NMR  (S.NMR),fiber
diffraction  (Fiber  D.)  and  neutron  (NDiff.)  are  used  to
determine  protein  structure  by  various  properties  like
Ph,temperature,  angles,resolution,number  of  RMS  deviation,
number of perfect reflections, number of non-hydrogen atoms
etc  with  accuracies  of  96.5217%.  These  shows  all  the
parameters  are  efficient to  classify and  predict  HIV protein
structures.  Accuracies  include  true  positive,  false  positives,
precision,  recall,f-measure,ROC.  In  the  context  of
classification  tasks, the  terms  true  positives, true  negatives,
false positives and false negatives (see also Type I and type II
errors) are used to compare the given classification of an item
(the class label assigned to the item by a classifier) with the
CiiT International Journal of Biometrics and Bioinformatics, Vol 3, No 9, September 2011  417
0974-9675/CIIT–IJ-2217/05/$20/$100 © 2011 CiiT     Published by the Coimbatore Institute of Information Technology
desired  correct  classification  (the  class  the  item  actually
belongs to).
Precision and recall are then defined as (1,2):
Re tp
call tp fn
; Equation 1
Pr tp
ecision tp fp
; Equation 2
Recall in this context is also referred to as the True Positive
Rate,  other  related  measures  used  in  classification  include
True Negative Rate (4) and True Negative Rate is also called
Specificity (3).
tp tn
Accuracy tp tn fp fn
; Equation 3
tn
TrueNegativeRate tn fp
; Equation 4
A  measure  that  combines  precision  and  recall  is  the
harmonic  mean  of  precision  and  recall,  the  traditional  F-
measure or balanced F-score (5):
2precision recall
Fprecision recall
; Equation 5
 II. METHODS
A  major  focus  of  machine  learning  research  is  to
automatically learn to  recognize complex  patterns and make
intelligent  decisions  based  on data; the  difficulty  lies in the
fact  that the set  of all possible  behaviors given  all possible
inputs  is  too  large  to  be  covered  by  the  set  of  observed
examples (training  data).  Hence  the learner  must generalize
from the given examples, so as to be able to produce a useful
output  in  new  cases.The  proteins  used  for  this  study  were
collected from Protein Databank. And the algorithms used for
prediction and classification of HIV proteins on the basis of
Ph,temperature,  Angles,Resolution,Reflection,  refinement,
number of  RMS Deviation,  number of  non-hydrogen  atoms
used    in  refinement.  The  algorithms  used  are  taken  from
WEKA, a machine learning software package [8].
J48: A decision tree is a flowchart-like tree structure, where
each  internal  node  (non  leaf  node)  denotes  a  test  on  an
attribute, each branch represents an outcome of the test, and
each  leaf  node  (or  terminal node)  holds  a  class  label.  The
topmost  node in  a tree  is the  root  node. Internal nodes  are
denoted  by rectangles, and  leaf  nodes  are denoted by ovals.
The construction of decision tree classifiers does not require
any domain knowledge or parameter setting, and therefore is
appropriate for exploratory knowledge discovery [8]. 
Bagging: Bagging also called as bootstrap aggregating, is a
technique that repeatedly samples from a data set according to
a uniform probability distribution. Each bootstrap sample has
the same size as the original data [8].
III. RESULT
For  all the  parameters of structure  determination  here are
the  results  of  the  algorithms  used.  Table  1  showing  better
results of Bagging algorithm.
TABLE 1
FOLLOWING ARE THE PARAMETERS PREDICT BETTER RESULTS FOR BAGGING.
S
No.
Parameter
Accuracy
Time
taken
Average
1.
Angles
96.5217%
0 sec
0.424
2.
Resolution
96.5217%
0 sec
0.455
3.
Number of
Non hydrogen
atoms
96.5217%
0.02 sec
0.455
The detailed accuracy, confusion matrix and Roc of above
parameters are as follows:
1) Angles showing accuracies for prediction and
classification
Detailed Accuracy by Class
 TP Rate FP Rate   Precision   Recall F-Measure ROC Area
Class
  1           1           0.965     1         0.982     0.517        XDiff.
  0           0           0            0           0          0.524        S.NMR
  0           0           0            0           0          0.049         N-Diff
  0          0           0            0           0           0.049        Fiber D.
Weighted  Avg.
 0.965     0.965      0.932      0.965       0.948               0.513

Confusion Matrix
   a   b   c   d   e   <-- classified as
 333   0   0   0   0 |   a = X-Ray Diff.
   9   0   0   0   0 |   b = S.NMR
   1   0   0   0   0 |   c = N-Diff.
   1   0   0   0   0 |   d = Fiber Diff.
   1   0   0   0   0 |   e = x-Ray Diff.


Fig 1. ROC of Angles is shown above.
2) Resolution showing accuracies for prediction and
classification
 Detailed Accuracy By Class 
TPRate FP Rate Precision  Recall  F-Measure  ROC   Class
1        1      0.965     1       0.982    0.517    X-RayD.
 0       0      0            0       0           0.524    S.NMR      
CiiT International Journal of Biometrics and Bioinformatics, Vol 3, No 9, September 2011  415
0974-9675/CIIT–IJ-2217/05/$20/$100 © 2011 CiiT     Published by the Coimbatore Institute of Information Technology
 0       0      0            0       0           0.049    N.Diff.      
 0       0      0            0       0           0.049    Fiber Diff.
Weighted  Avg.
  0.965     0.965          0.932          0.965         0.948      0.513

=== Confusion Matrix ===
   a   b   c   d   e   <-- classified as
 333   0   0   0   0 |   a = X-Ray D.
   9   0   0   0   0 |   b = S.NMR
   1   0   0   0   0 |   c = N-Diff.
   1   0   0   0   0 |   d = Fiber Diff.


Fig 2 ROC of Resolution is shown above
3) Number of non hydrogen atoms based prediction and
classification.
 TP  Rate  FP  Rate   Precision    Recall    F-Measure    ROC  
Class
1      1       0.965      1      0.982   0.608    X-RayD. 
0      0       0             0      0          0.659    S.NMR
0      0       0             0      0          0.049    N-Diff.
0      0       0             0      0          0.049    FiberD. 
Weighted Avg.  
  0.965      0.965      0.932            0.965     0.948        0.604

=== Confusion Matrix ===
   a   b   c   d   e   <-- classified as
 333   0   0   0   0 |   a = X-Ray Diff.
   9   0   0   0   0 |   b = S.NMR
   1   0   0   0   0 |   c = N-Diff.
   1   0   0   0   0 |   d = Fiber Diff.
   1   0   0   0   0 |   e = x-Ray Diffraction


Fig 3 ROC of Non Hydrogen Atoms is shown above
  TABLE 2
PARAMETERS FOR PREDICTION AND CLASSIFICATION ON THE BASIS O
A  confusion  matrix  displays  the  number  of  correct  and
incorrect  predictions made  by the  model compared  with the
actual  classifications  in the  test data  .In each parameter  the
confusion matrix correctly predicts the better results.  The area
under ROC curve specifies the probability that, when to draw
one positive and one negative example at random, the decision
function  assigns  a  higher  value  to  the  positive  than  to  the
negative.  The  ROC  of PH,temperature,resolution,number  of 
non-hydrogen  atoms,angles,number  of  perfect  reflections,
number  of  RMS  deviations  are  as
0.427,0.427,0.517,0.608,0.517,0.427,0.427 with true positives
1  in each  parameter. Hence  by  simulation Bagging and  J48
machine learning algorithms  proved  better  in predicting  and
classifying protein structure on the basis of their parameters as
used by structure determination methods.
V. CONCLUSION
 Due to high efficiency of data mining or machine learning
techniques  structural  classification  of  HIV  proteins  can  be
done  with  fair accuracy.  Such approaches  can develop  new
insights  for  structure  classification  of  HIV  proteins  to  find
drug targets and protein engineering. This will also helpful to
develop databases.  And any  new  protein engineered  or find
out can further be classified as the models developed.
ACKNOWLEDGEMENT
The  authors  are  highly  thankful  to  Department  of
biotechnology, New Delhi for providing Bioinformatics Infra
Structures  Facility  at  MANIT,  Bhopal  for  carrying  out  this
work.
REFERENCES
[1] http://en.wikipedia.org/wiki/Secondary_structure#Secondary_structure
[2] http: www.pdb.org
[3] Olson,  David L.;  Delen, Dursun ”Advanced  Data Mining  Techniques”
Springer; 1 edition (February 1, 2008), page 138, ISBN 3540769161
[4] Structural bioinformatics by Philip E.Bourne, Helge Weissig, San Diego
super computer  center, University of california


DR ANUBHA DUBEY INDEPENDENT RESEARCH INDIA & DIRECTOR kanishkbioscience E-learning Platform
PHONE NO,9993210963



Social media is bold.

Social media is young.

Social media raises questions.

 Social media is not satisfied with an answer.

Social media looks at the big picture.

 Social media is interested in every detail.

social media is curious.

 Social media is free.

Social media is irreplaceable.

But never irrelevant.

Social media is you.

(With input from news agency language)

 If you like this story, share it with a friend!  

We are a non-profit organization. Help us financially to keep our journalism free from government and corporate pressure

Post a Comment

0 Comments

Custom Real-Time Chart Widget

'; (function() { var dsq = document.createElement('script'); dsq.type = 'text/javascript'; dsq.async = true; dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js'; (document.getElementsByTagName('head')[0] || document.getElementsByTagName('body')[0]).appendChild(dsq); })();

market stocks NSC