Abstract
Solution NMR spectroscopy, X-ray crystallography, fiber diffraction, neutron diffraction are the fundamental structure determination methods in the development of many scientific fields. The methods also revealed the structure and functioning of many biological molecules, including vitamins, drugs, proteins and nucleic acids such as DNA. X-ray crystallography is still the chief method for characterizing the atomic structure of new materials and in discerning materials that appear similar by other experiments. Looking towards the importance of the structure determination methods the HIV enzyme secondary structures is taken to classify and predict on the basis of their biochemical and physicochemical methods like resolution, Ph, temperature, number of non hydrogen atoms and other parameters by machine learning techniques. Machine learning techniques are fast and economical and predict better results of classification and prediction. The accuracies of all the predicted parameters are predicted as 96.5217%.
Keywords---Accuracy, Classification, Prediction Resolution,
Temperature
I. INTRODUCTION
IO molecular structure prediction is the prediction of the
three-dimensional structure of a protein from its amino
acid sequence, or of a nucleic acid from its base sequence. In
other words, it is the prediction of secondary and tertiary
structure from its primary structure. Structure prediction is the
inverse of bio molecular design. Protein structure prediction
is one of the most important goals pursued by bioinformatics
and theoretical chemistry. Protein structure prediction is of
high importance in medicine (for example, in drug design) and
biotechnology (for example, in the design of novel
enzymes)[1]. Due to their importance, the HIV protein
structure can be studied and determined to find out drug
targets. Several methods are currently used to determine the
structure of a protein, including X-ray crystallography, NMR
spectroscopy, and electron microscopy. Each method has
advantages and disadvantages. In each of these methods, the
scientist uses many pieces of information to create the final
atomic model. Primarily, the scientist has some kind of
experimental data about the structure of the molecule. For X-
ray crystallography, this is the X-ray diffraction pattern. For
NMR spectroscopy, it is information on the local
Manuscript received on July 29, 2011, review completed on August 17,
2011 and revised on August 25, 2011.
Anubha Dubey, Department of Bioinformatics, MANIT, BHOPAL
462051. E-Mail: anubhadubey@rediffmail.com
Dr. Usha Chauhan, Department of Mathematics, MANIT, BHOPAL
462051.
Digital Object Identifier No. BB092011004
Protein structure is closely linked with protein function, the
structural genomics has the potential to inform knowledge of
protein function. In addition to elucidating protein functions,
structural genomics can be used to identify novel protein folds
and potential targets for drug discovery. Structural genomics
involves taking a large number of approaches to structure
determination, including experimental methods using genomic
sequences or modelling-based approaches based on sequence
or structural homology to a protein of known structure or
based on chemical and physical principles for a protein with
no homology to any known structure. Experimental methods
of protein structure determination require proteins that express
and/or crystallize well, which may inherently bias the kinds of
proteins folds that this experimental data elucidate. Hence data
mining techniques are fast and accurate to find the accurate set
of diagnostic attributes to study protein structure by
classification. Classification is a data mining function that
assigns items in a collection to target categories or classes.
The goal of classification is to accurately predict the target
class for each case in the data. Usually experimental methods
like X-ray crystallography(X Diff.),NMR (S.NMR),fiber
diffraction (Fiber D.) and neutron (NDiff.) are used to
determine protein structure by various properties like
Ph,temperature, angles,resolution,number of RMS deviation,
number of perfect reflections, number of non-hydrogen atoms
etc with accuracies of 96.5217%. These shows all the
parameters are efficient to classify and predict HIV protein
structures. Accuracies include true positive, false positives,
precision, recall,f-measure,ROC. In the context of
classification tasks, the terms true positives, true negatives,
false positives and false negatives (see also Type I and type II
errors) are used to compare the given classification of an item
(the class label assigned to the item by a classifier) with the
CiiT International Journal of Biometrics and Bioinformatics, Vol 3, No 9, September 2011 417
0974-9675/CIIT–IJ-2217/05/$20/$100 © 2011 CiiT Published by the Coimbatore Institute of Information Technology
desired correct classification (the class the item actually
belongs to).
Precision and recall are then defined as (1,2):
Re tp
call tp fn
; Equation 1
Pr tp
ecision tp fp
; Equation 2
Recall in this context is also referred to as the True Positive
Rate, other related measures used in classification include
True Negative Rate (4) and True Negative Rate is also called
Specificity (3).
tp tn
Accuracy tp tn fp fn
; Equation 3
tn
TrueNegativeRate tn fp
; Equation 4
A measure that combines precision and recall is the
harmonic mean of precision and recall, the traditional F-
measure or balanced F-score (5):
2precision recall
Fprecision recall
; Equation 5
II. METHODS
A major focus of machine learning research is to
automatically learn to recognize complex patterns and make
intelligent decisions based on data; the difficulty lies in the
fact that the set of all possible behaviors given all possible
inputs is too large to be covered by the set of observed
examples (training data). Hence the learner must generalize
from the given examples, so as to be able to produce a useful
output in new cases.The proteins used for this study were
collected from Protein Databank. And the algorithms used for
prediction and classification of HIV proteins on the basis of
Ph,temperature, Angles,Resolution,Reflection, refinement,
number of RMS Deviation, number of non-hydrogen atoms
used in refinement. The algorithms used are taken from
WEKA, a machine learning software package [8].
J48: A decision tree is a flowchart-like tree structure, where
each internal node (non leaf node) denotes a test on an
attribute, each branch represents an outcome of the test, and
each leaf node (or terminal node) holds a class label. The
topmost node in a tree is the root node. Internal nodes are
denoted by rectangles, and leaf nodes are denoted by ovals.
The construction of decision tree classifiers does not require
any domain knowledge or parameter setting, and therefore is
appropriate for exploratory knowledge discovery [8].
Bagging: Bagging also called as bootstrap aggregating, is a
technique that repeatedly samples from a data set according to
a uniform probability distribution. Each bootstrap sample has
the same size as the original data [8].
III. RESULT
For all the parameters of structure determination here are
the results of the algorithms used. Table 1 showing better
results of Bagging algorithm.
TABLE 1
FOLLOWING ARE THE PARAMETERS PREDICT BETTER RESULTS FOR BAGGING.
S
No.
Parameter
Accuracy
Time
taken
Average
1.
Angles
96.5217%
0 sec
0.424
2.
Resolution
96.5217%
0 sec
0.455
3.
Number of
Non hydrogen
atoms
96.5217%
0.02 sec
0.455
The detailed accuracy, confusion matrix and Roc of above
parameters are as follows:
1) Angles showing accuracies for prediction and
classification
Detailed Accuracy by Class
TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
1 1 0.965 1 0.982 0.517 XDiff.
0 0 0 0 0 0.524 S.NMR
0 0 0 0 0 0.049 N-Diff
0 0 0 0 0 0.049 Fiber D.
Weighted Avg.
0.965 0.965 0.932 0.965 0.948 0.513
Confusion Matrix
a b c d e <-- classified as
333 0 0 0 0 | a = X-Ray Diff.
9 0 0 0 0 | b = S.NMR
1 0 0 0 0 | c = N-Diff.
1 0 0 0 0 | d = Fiber Diff.
1 0 0 0 0 | e = x-Ray Diff.
Fig 1. ROC of Angles is shown above.
2) Resolution showing accuracies for prediction and
classification
Detailed Accuracy By Class
TPRate FP Rate Precision Recall F-Measure ROC Class
1 1 0.965 1 0.982 0.517 X-RayD.
0 0 0 0 0 0.524 S.NMR
CiiT International Journal of Biometrics and Bioinformatics, Vol 3, No 9, September 2011 415
0974-9675/CIIT–IJ-2217/05/$20/$100 © 2011 CiiT Published by the Coimbatore Institute of Information Technology
0 0 0 0 0 0.049 N.Diff.
0 0 0 0 0 0.049 Fiber Diff.
Weighted Avg.
0.965 0.965 0.932 0.965 0.948 0.513
=== Confusion Matrix ===
a b c d e <-- classified as
333 0 0 0 0 | a = X-Ray D.
9 0 0 0 0 | b = S.NMR
1 0 0 0 0 | c = N-Diff.
1 0 0 0 0 | d = Fiber Diff.
Fig 2 ROC of Resolution is shown above
3) Number of non hydrogen atoms based prediction and
classification.
TP Rate FP Rate Precision Recall F-Measure ROC
Class
1 1 0.965 1 0.982 0.608 X-RayD.
0 0 0 0 0 0.659 S.NMR
0 0 0 0 0 0.049 N-Diff.
0 0 0 0 0 0.049 FiberD.
Weighted Avg.
0.965 0.965 0.932 0.965 0.948 0.604
=== Confusion Matrix ===
a b c d e <-- classified as
333 0 0 0 0 | a = X-Ray Diff.
9 0 0 0 0 | b = S.NMR
1 0 0 0 0 | c = N-Diff.
1 0 0 0 0 | d = Fiber Diff.
1 0 0 0 0 | e = x-Ray Diffraction
Fig 3 ROC of Non Hydrogen Atoms is shown above
TABLE 2
PARAMETERS FOR PREDICTION AND CLASSIFICATION ON THE BASIS O
A confusion matrix displays the number of correct and
incorrect predictions made by the model compared with the
actual classifications in the test data .In each parameter the
confusion matrix correctly predicts the better results. The area
under ROC curve specifies the probability that, when to draw
one positive and one negative example at random, the decision
function assigns a higher value to the positive than to the
negative. The ROC of PH,temperature,resolution,number of
non-hydrogen atoms,angles,number of perfect reflections,
number of RMS deviations are as
0.427,0.427,0.517,0.608,0.517,0.427,0.427 with true positives
1 in each parameter. Hence by simulation Bagging and J48
machine learning algorithms proved better in predicting and
classifying protein structure on the basis of their parameters as
used by structure determination methods.
V. CONCLUSION
Due to high efficiency of data mining or machine learning
techniques structural classification of HIV proteins can be
done with fair accuracy. Such approaches can develop new
insights for structure classification of HIV proteins to find
drug targets and protein engineering. This will also helpful to
develop databases. And any new protein engineered or find
out can further be classified as the models developed.
ACKNOWLEDGEMENT
The authors are highly thankful to Department of
biotechnology, New Delhi for providing Bioinformatics Infra
Structures Facility at MANIT, Bhopal for carrying out this
work.
REFERENCES
[1] http://en.wikipedia.org/wiki/Secondary_structure#Secondary_structure
[2] http: www.pdb.org
[3] Olson, David L.; Delen, Dursun ”Advanced Data Mining Techniques”
Springer; 1 edition (February 1, 2008), page 138, ISBN 3540769161
[4] Structural bioinformatics by Philip E.Bourne, Helge Weissig, San Diego
super computer center, University of california
Solution NMR spectroscopy, X-ray crystallography, fiber diffraction, neutron diffraction are the fundamental structure determination methods in the development of many scientific fields. The methods also revealed the structure and functioning of many biological molecules, including vitamins, drugs, proteins and nucleic acids such as DNA. X-ray crystallography is still the chief method for characterizing the atomic structure of new materials and in discerning materials that appear similar by other experiments. Looking towards the importance of the structure determination methods the HIV enzyme secondary structures is taken to classify and predict on the basis of their biochemical and physicochemical methods like resolution, Ph, temperature, number of non hydrogen atoms and other parameters by machine learning techniques. Machine learning techniques are fast and economical and predict better results of classification and prediction. The accuracies of all the predicted parameters are predicted as 96.5217%.
Temperature
I. INTRODUCTION
IO molecular structure prediction is the prediction of the
three-dimensional structure of a protein from its amino
acid sequence, or of a nucleic acid from its base sequence. In
other words, it is the prediction of secondary and tertiary
structure from its primary structure. Structure prediction is the
inverse of bio molecular design. Protein structure prediction
is one of the most important goals pursued by bioinformatics
and theoretical chemistry. Protein structure prediction is of
high importance in medicine (for example, in drug design) and
biotechnology (for example, in the design of novel
enzymes)[1]. Due to their importance, the HIV protein
structure can be studied and determined to find out drug
targets. Several methods are currently used to determine the
structure of a protein, including X-ray crystallography, NMR
spectroscopy, and electron microscopy. Each method has
advantages and disadvantages. In each of these methods, the
scientist uses many pieces of information to create the final
atomic model. Primarily, the scientist has some kind of
experimental data about the structure of the molecule. For X-
ray crystallography, this is the X-ray diffraction pattern. For
NMR spectroscopy, it is information on the local
Manuscript received on July 29, 2011, review completed on August 17,
2011 and revised on August 25, 2011.
Anubha Dubey, Department of Bioinformatics, MANIT, BHOPAL
462051. E-Mail: anubhadubey@rediffmail.com
Dr. Usha Chauhan, Department of Mathematics, MANIT, BHOPAL
462051.
Digital Object Identifier No. BB092011004
Protein structure is closely linked with protein function, the
structural genomics has the potential to inform knowledge of
protein function. In addition to elucidating protein functions,
structural genomics can be used to identify novel protein folds
and potential targets for drug discovery. Structural genomics
involves taking a large number of approaches to structure
determination, including experimental methods using genomic
sequences or modelling-based approaches based on sequence
or structural homology to a protein of known structure or
based on chemical and physical principles for a protein with
no homology to any known structure. Experimental methods
of protein structure determination require proteins that express
and/or crystallize well, which may inherently bias the kinds of
proteins folds that this experimental data elucidate. Hence data
mining techniques are fast and accurate to find the accurate set
of diagnostic attributes to study protein structure by
classification. Classification is a data mining function that
assigns items in a collection to target categories or classes.
The goal of classification is to accurately predict the target
class for each case in the data. Usually experimental methods
like X-ray crystallography(X Diff.),NMR (S.NMR),fiber
diffraction (Fiber D.) and neutron (NDiff.) are used to
determine protein structure by various properties like
Ph,temperature, angles,resolution,number of RMS deviation,
number of perfect reflections, number of non-hydrogen atoms
etc with accuracies of 96.5217%. These shows all the
parameters are efficient to classify and predict HIV protein
structures. Accuracies include true positive, false positives,
precision, recall,f-measure,ROC. In the context of
classification tasks, the terms true positives, true negatives,
false positives and false negatives (see also Type I and type II
errors) are used to compare the given classification of an item
(the class label assigned to the item by a classifier) with the
CiiT International Journal of Biometrics and Bioinformatics, Vol 3, No 9, September 2011 417
0974-9675/CIIT–IJ-2217/05/$20/$100 © 2011 CiiT Published by the Coimbatore Institute of Information Technology
desired correct classification (the class the item actually
belongs to).
Precision and recall are then defined as (1,2):
Re tp
call tp fn
; Equation 1
Pr tp
ecision tp fp
; Equation 2
Recall in this context is also referred to as the True Positive
Rate, other related measures used in classification include
True Negative Rate (4) and True Negative Rate is also called
Specificity (3).
tp tn
Accuracy tp tn fp fn
; Equation 3
tn
TrueNegativeRate tn fp
; Equation 4
A measure that combines precision and recall is the
harmonic mean of precision and recall, the traditional F-
measure or balanced F-score (5):
2precision recall
Fprecision recall
; Equation 5
II. METHODS
A major focus of machine learning research is to
automatically learn to recognize complex patterns and make
intelligent decisions based on data; the difficulty lies in the
fact that the set of all possible behaviors given all possible
inputs is too large to be covered by the set of observed
examples (training data). Hence the learner must generalize
from the given examples, so as to be able to produce a useful
output in new cases.The proteins used for this study were
collected from Protein Databank. And the algorithms used for
prediction and classification of HIV proteins on the basis of
Ph,temperature, Angles,Resolution,Reflection, refinement,
number of RMS Deviation, number of non-hydrogen atoms
used in refinement. The algorithms used are taken from
WEKA, a machine learning software package [8].
J48: A decision tree is a flowchart-like tree structure, where
each internal node (non leaf node) denotes a test on an
attribute, each branch represents an outcome of the test, and
each leaf node (or terminal node) holds a class label. The
topmost node in a tree is the root node. Internal nodes are
denoted by rectangles, and leaf nodes are denoted by ovals.
The construction of decision tree classifiers does not require
any domain knowledge or parameter setting, and therefore is
appropriate for exploratory knowledge discovery [8].
Bagging: Bagging also called as bootstrap aggregating, is a
technique that repeatedly samples from a data set according to
a uniform probability distribution. Each bootstrap sample has
the same size as the original data [8].
III. RESULT
For all the parameters of structure determination here are
the results of the algorithms used. Table 1 showing better
results of Bagging algorithm.
TABLE 1
FOLLOWING ARE THE PARAMETERS PREDICT BETTER RESULTS FOR BAGGING.
S
No.
Parameter
Accuracy
Time
taken
Average
1.
Angles
96.5217%
0 sec
0.424
2.
Resolution
96.5217%
0 sec
0.455
3.
Number of
Non hydrogen
atoms
96.5217%
0.02 sec
0.455
The detailed accuracy, confusion matrix and Roc of above
parameters are as follows:
1) Angles showing accuracies for prediction and
classification
Detailed Accuracy by Class
TP Rate FP Rate Precision Recall F-Measure ROC Area
Class
1 1 0.965 1 0.982 0.517 XDiff.
0 0 0 0 0 0.524 S.NMR
0 0 0 0 0 0.049 N-Diff
0 0 0 0 0 0.049 Fiber D.
Weighted Avg.
0.965 0.965 0.932 0.965 0.948 0.513
Confusion Matrix
a b c d e <-- classified as
333 0 0 0 0 | a = X-Ray Diff.
9 0 0 0 0 | b = S.NMR
1 0 0 0 0 | c = N-Diff.
1 0 0 0 0 | d = Fiber Diff.
1 0 0 0 0 | e = x-Ray Diff.
Fig 1. ROC of Angles is shown above.
2) Resolution showing accuracies for prediction and
classification
Detailed Accuracy By Class
TPRate FP Rate Precision Recall F-Measure ROC Class
1 1 0.965 1 0.982 0.517 X-RayD.
0 0 0 0 0 0.524 S.NMR
CiiT International Journal of Biometrics and Bioinformatics, Vol 3, No 9, September 2011 415
0974-9675/CIIT–IJ-2217/05/$20/$100 © 2011 CiiT Published by the Coimbatore Institute of Information Technology
0 0 0 0 0 0.049 N.Diff.
0 0 0 0 0 0.049 Fiber Diff.
Weighted Avg.
0.965 0.965 0.932 0.965 0.948 0.513
=== Confusion Matrix ===
a b c d e <-- classified as
333 0 0 0 0 | a = X-Ray D.
9 0 0 0 0 | b = S.NMR
1 0 0 0 0 | c = N-Diff.
1 0 0 0 0 | d = Fiber Diff.
Fig 2 ROC of Resolution is shown above
3) Number of non hydrogen atoms based prediction and
classification.
TP Rate FP Rate Precision Recall F-Measure ROC
Class
1 1 0.965 1 0.982 0.608 X-RayD.
0 0 0 0 0 0.659 S.NMR
0 0 0 0 0 0.049 N-Diff.
0 0 0 0 0 0.049 FiberD.
Weighted Avg.
0.965 0.965 0.932 0.965 0.948 0.604
=== Confusion Matrix ===
a b c d e <-- classified as
333 0 0 0 0 | a = X-Ray Diff.
9 0 0 0 0 | b = S.NMR
1 0 0 0 0 | c = N-Diff.
1 0 0 0 0 | d = Fiber Diff.
1 0 0 0 0 | e = x-Ray Diffraction
Fig 3 ROC of Non Hydrogen Atoms is shown above
TABLE 2
PARAMETERS FOR PREDICTION AND CLASSIFICATION ON THE BASIS O
A confusion matrix displays the number of correct and
incorrect predictions made by the model compared with the
actual classifications in the test data .In each parameter the
confusion matrix correctly predicts the better results. The area
under ROC curve specifies the probability that, when to draw
one positive and one negative example at random, the decision
function assigns a higher value to the positive than to the
negative. The ROC of PH,temperature,resolution,number of
non-hydrogen atoms,angles,number of perfect reflections,
number of RMS deviations are as
0.427,0.427,0.517,0.608,0.517,0.427,0.427 with true positives
1 in each parameter. Hence by simulation Bagging and J48
machine learning algorithms proved better in predicting and
classifying protein structure on the basis of their parameters as
used by structure determination methods.
V. CONCLUSION
Due to high efficiency of data mining or machine learning
techniques structural classification of HIV proteins can be
done with fair accuracy. Such approaches can develop new
insights for structure classification of HIV proteins to find
drug targets and protein engineering. This will also helpful to
develop databases. And any new protein engineered or find
out can further be classified as the models developed.
ACKNOWLEDGEMENT
The authors are highly thankful to Department of
biotechnology, New Delhi for providing Bioinformatics Infra
Structures Facility at MANIT, Bhopal for carrying out this
work.
REFERENCES
[1] http://en.wikipedia.org/wiki/Secondary_structure#Secondary_structure
[2] http: www.pdb.org
[3] Olson, David L.; Delen, Dursun ”Advanced Data Mining Techniques”
Springer; 1 edition (February 1, 2008), page 138, ISBN 3540769161
[4] Structural bioinformatics by Philip E.Bourne, Helge Weissig, San Diego
super computer center, University of california
DR ANUBHA DUBEY INDEPENDENT RESEARCH INDIA & DIRECTOR kanishkbioscience E-learning Platform
PHONE NO,9993210963
Social media is bold.
Social media is young.
Social media raises questions.
Social media is not satisfied with an answer.
Social media looks at the big picture.
Social media is interested in every detail.
social media is curious.
Social media is free.
Social media is irreplaceable.
But never irrelevant.
Social media is you.
(With input from news agency language)
If you like this story, share it with a friend!
We are a non-profit organization. Help us financially to keep our journalism free from government and corporate pressure
0 Comments