SVM Model for Amino Acid Composition Based Classification of HIV-1 Groups,

Abstract
HIV is a human immunodeficiency virus causes AIDS (acquired immunodeficiency syndrome) which leads to life-threatening opportunistic infections. HIV-1 has three groups M, N, O known worldwide. Group M is widely distributed as it has nine subtypes and circulating recombinant forms are also developed due to rapid recombination and mutation. They play an important role in diagnosing the correct group of HIV-1. Thus there arises the need to understand the relationships among various parameters of the proteins of HIV-1 M, N, and O for prediction of their classes, structures and functionality. The overlapping patterns in the three groups lead to uncertainty in the prediction of groups and thus pose challenges for the development of computational for prediction of classes with fair accuracy. The computational approaches for prediction of their classes are fast and economical therefore can be used to complement the existing wet lab techniques. Realizing their importance, in this paper, an attempt has been made to correlate them with their amino acid composition and predict them with fair accuracy. The SVM has been implemented using Lib SVM package. The method discriminates MN, NO, MO from MNO using amino acid composition. The performance of the method was evaluated using 10-fold cross-validation where the accuracy of 99.93% was obtained for MNO, accuracy three groups MN, NO, MO was 88.64%, 89.02%, 96.11% respectively.
Keywords-
IV groups; Support vector machine; Amino Acid
composition; Kernel function.
I. INTRODUCTION
HIV is an abbreviation for Human Immunodeficiency
Virus. Based on the data from Sentinel Surveillance and
National Family Health Surveys III in year 2005-2006 it has
been estimated that 2 to 2.1 million people were living with
HIV/AIDS in India in 2006 [1]. There are two forms of HIV:
HIV-1 and HIV-2. HIV-1 was discovered by Luc
Montagnier and his associates at the Institute Pasteur in Paris
in 1983. HIV-2 was first identified among patients in
Cameroon in 1985. HIV-2 is more similar to SIV (Simian
Immunodeficiency Virus) than is HIV-1 and it is much less
virulent (usually not resulting in full blown AIDS, but still
fatal). One of the obstacles to treatment of HIV is its high
genetic variability [2]. The HIV variants are divided into
three groups: M, for major, N for New and O, for other or
Outlier. Within the M-group there are at least ten subtypes or
clades: A,B,C,C,E,F,G,H,I,J ,K and U. Type C is also the
predominant form in India and Nepal. It is the subtype C that
causes most of the infection worldwide [3]. In the September
1, 1998 issue of Nature Medicine, F.Simon announced the discovery of a variant of HIV-1 that fits neither the M nor O
groupings. It seems to fall between the M-group and the
immunodeficiency virus, SIV. It is the N-group “non-M,
non-O”.As of 2006, only 10 Group N infections had been
identified [4].The first discovered case occurred in a woman
in Cameroon and all tests with EIA or Western Blot were
negative [5]. The O ("Outlier") group is not usually seen
outside of West-central Africa. It is reportedly most common
in Cameroon, where a 1997 survey found that about 2% of
HIV-positive samples were from Group.The group caused
some concern because it could not be detected by early
versions of the HIV-1 test kit. More advanced HIVtest have
now been developed to detect both Group O and Group N
[6], HIV contains nine gene made of 9749 base pairs. All
retrovirus contain the genes gag (codes for internal structural
proteins and capsid proteins using about 2000 base pairs),
pol (codes for the three enzymes necessary for replication
using 2900 base pairs). Other genes within HIV are tat
(transactivator protein), rev (regulator of expression of Virus
protein), vif (virus infectivity factor), nef (misnamed
negative regulator factor, but really an enhancing factor), vpr
(virus protein R), and vpu (Virus protein U) encoding 19
proteins [7]. Proteins (also known as polypeptides) are
organic compounds made of amino acids arranged in a linear
chain and folded into a globular form. The amino acids in a
polymer chain are joined together by the peptide bonds. The
sequence of amino acids in a protein is defined by the
sequence of gene. In this present approach we are trying to
develop a classifier to classify HIV1 groups on the basis of
their amino acid composition. As proteins are assembled
from amino acid using information encoded specified by the
some amino acids using information encoded in genes. Each
protein has its own unique amino acid sequence that is
specified by the nucleotide sequence of the gene encoding
this protein. The size of a synthesized protein can be
measured by the number of amino acids it contains and by its
total molecular mass, which is normally reported in units of
Daltons. The experimental attempts are reported in the
literature that there are three groups of HIV-1 according to
their occurrences in different countries. But no
computational technique is available in the literature for
classification of HIV-1 groups based on other parameters
like dipeptide composition, amino acid composition and
physiochemical properties. Due to overlapping patterns in
various groups of HIV-1 proteins there arises a need to
develop fast and accurate computational approaches for
prediction and classification of sequences. In view of the above an attempt has been made in this
paper to develop a computational approach for predicting
and classifying three groups of HIV-1. This is a binary
classification method where the M, N, O can be
discriminated as MN, MO, NO and MNO. It has been shown
in past that SVM is an elegant technique for the classification
of biological data [8], [9], [10-13]. Here SVM model has
been developed for amino acid composition based prediction
identification and classification of a different groups of HIV MATERIALS AND METHODS
.Data set
To achieve our goal and develop our methodology we
obtained the dataset from Swissprot/Uniprot databank of
Expasy server (12). The following two data sets were used.
Dataset1: It consisted of all the proteinsof members of M, N,
O group i.e MNO. All the entries marked as fragments were
not included in the dataset. The total instances were for M,
for N, for O is 752. The 356 were positive belonging to
MNO and 356 were negative instances belonging to
enzymatic group.
Dataset2: It consisted of all the proteins of members of MN
group. All the entries marked as fragments were not
included in the dataset. The total instances were 295 for M
group and 18 for N group. The final dataset consisted of 313
sequences belonging to both M and N class of MN.
Dataset3: It consisted of all the nucleotides of members of
NO group. All the entries marked as fragments were not
included in the dataset.
They were treated as negative instances. The total instances
were 18 for N group and 42 for O group. The final dataset
consisted of both N and O class of NO.
Dataset4: It consisted of all the nucleotides of members of
MO group. All the entries marked as fragments were not
included in the dataset. They were treated as negative
instances. They total instances were 295 for M and 42 for O
group.
For training dataset we consider sequences belonging to
MNO. Support vector machine (Binary classification) is used
for classification.
.Support vector machine (Binary classification)
SVM is a supervised machine learning method which is
based on the statistical learning theory [15,16]. When used as
a binary classifier, an SVM will construct a hyperplane,
which acts as the decision surface between the two classes.
This is achieved by maximizing the margin of separation
between the hyperplane and those points nearest to it. The
SVMs were implemented using freely downloadable
software, libSVM [14]. In this software there is a facility to
define parameters and choose among various inbuilt kernals.
They can be radial basis function (RBF) or a polynomial
kernel (of given degree), linear, sigmoid.
SVM software; LIBSVM
Simulations were performed using LIBSVM version 2.89
(a freely available software package) (14). For our study
RBF Kernel was found to be the best. The SVM training was
carried out by the optimization of the value of the
regularization parameter and the value of RBF kernel
parameter.,
Amino Acid Composition
Previously, this parameter has been used for predicting
the subcellular localization of proteins (10). The amino acid
composition is the fraction of each amino acid type within a
protein.
The fractions of all 20 natural amino acids were
calculated by using Equation 1,
Polycomp
The input vector of 450 was generated directly in the
format of SVM by software Polycomp developed under
Department of Bioinformatics, MANIT, Bhopal, India (19).
This software generates data which can be directly fed into
the classifier hence saving valuable time and energy needed
for formatting   the hybrid .
.Evaluation of Performance
The performance of our classifier was judged by 10 fold
cross validation. The LibSVM provides a parameter
selection tool using the RBF kernel: cross validation via grid
search. A grid search was performed on C and Gamma using
an inbuilt module of libSVM tools on Dataset 1,Dataset
2,Dataset 3 and Dataset 4 as shown in
Figure1,Figure2,Figure3, and Figure4. Here pairs of C and
Gamma are tried and the one with the best cross validation
accuracy

Figure 1. Grid Search on C and Gamma on the Data set 1
is picked. On using the values of C=2 and Gamma=0.5
obtained through grid search an accuracy of 99.93% was
obtained on Dataset 1,the value of C=0.125 and Gamma=
2.0obtained through grid search on Dataset 2,the value of
2010 International Conference on Bioinformatics and Biomedical Technology 121
C= 8.0and Gamma= 2.0 obtained through grid search on
Dataset 3,the value of C=32.0 and Gamma=0.125 obtained
through grid search on Dataset4.Prediction system
assessment True positive (TP) and true negatives (TN) were
identified as the positive and negative samples, respectively.
False positives (FP) were negative samples identified as
positive. False negatives (FN) were positive samples
identified as negative. The prediction performance was
tested with sensitivity (TP/ (TP+FN)), specificity (TN/
(TN+FP)), and overall accuracy (Q2). The accuracy for each
group of HIV-1 was calculated as described by Hua and Sun
[14] and shown below in equation 2.

Figure 2.Grid Search on C and Gamma on the Dataset 2
              Figure 3.Grid Search on C and Gamma on the Data set
1
Figure 4.Grid Search on C and Gamma on the Dataset 4
III. RESULTS AND DISCUSSION
The results obtained here will be helpful in differentiating
between different groups of HIV-1. A new protein
discovered can be shown to either belonging to MN, NO,
OM groups of proteins of HIV-1.The c, g and accuracy of
groups of HIV-1 MNO are given in Table 1.
T
ABLE
1: T
HE C
,
G AND ACCURACY OF GROUPS OF
HIV1.
Group c g accuracy
MNO 2 0.5 99.93%
The three groups MN, NO, M O equally implicated for
disease progression. Mainly M because it has nine subtypes.
The results of classifying M, N, and O using amino acid
composition are given in Table2.
T
ABLE
2: T
HE C
,
G AND ACCURACY OF GROUPS OF
HIV1.
S. No. GROUPS C g accuracy
1) MN 0.125 2.0 88.6435%
2) MO 8.0 2.0 96.115%
3 NO 32.0 0.125 89.02%
Figure 5.ROC curve for MNO depicting FP and TP.
Our results clearly highlight the importance of amino
acid composition in differentiating between these groups.
This model can also be an important tool to understand the
differences between M, N, and O. Hence a step towards
assisting various wet lab techniques in devising novel drugs
and therapeutic agents against these three. The correlation of
M, N, O with their amino acid composition explored here
can be useful to obtain better insight about these proteins.
Their molecular and physiological roles along with the
substrate affinity can also be correlated with amino acid
composition. The accuracy of predicting group MNO was
found to be 99.93%.The overall accuracy of the amino acid
composition-based classifier for classifying the three groups
MN, MO, NO was 88.64%, 96.11% and 89.02%
respectively. It proved that M, N, O can be correlated with
amino acid composition and can be easily distinguished on
this basis.
The receiver operating characteristics (ROC) score is
usually used as the primary measure of the machine learning
2010 International Conference on Bioinformatics and Biomedical Technology
122 - C= 8.0and Gamma= 2.0 obtained through grid search on
Dataset 3,the value of C=32.0 and Gamma=0.125 obtained
through grid search on Dataset4.Prediction system
assessment True positive (TP) and true negatives (TN) were
identified as the positive and negative samples, respectively.
False positives (FP) were negative samples identified as
positive. False negatives (FN) were positive samples
identified as negative. The prediction performance was
tested with sensitivity (TP/ (TP+FN)), specificity (TN/
(TN+FP)), and overall accuracy (Q2). The accuracy for each
group of HIV-1 was calculated as described by Hua and Sun
[14] and shown below in equation 2.

Figure 2.Grid Search on C and Gamma on the Dataset 2
              Figure 3.Grid Search on C and Gamma on the Data set
1
Figure 4.Grid Search on C and Gamma on the Dataset 4
III. RESULTS AND DISCUSSION
The results obtained here will be helpful in differentiating
between different groups of HIV-1. A new protein
discovered can be shown to either belonging to MN, NO,
OM groups of proteins of HIV-1.The c, g and accuracy of
groups of HIV-1 MNO are given in Table 1.
T
ABLE
1: T
HE C
,
G AND ACCURACY OF GROUPS OF
HIV1.
Group c g accuracy
MNO 2 0.5 99.93%
The three groups MN, NO, M O equally implicated for
disease progression. Mainly M because it has nine subtypes.
The results of classifying M, N, and O using amino acid
composition are given in Table2.
T
ABLE
2: T
HE C
,
G AND ACCURACY OF GROUPS OF
HIV1.
S. No. GROUPS C g accuracy
1) MN 0.125 2.0 88.6435%
2) MO 8.0 2.0 96.115%
3 NO 32.0 0.125 89.02%
Figure 5.ROC curve for MNO depicting FP and TP.
Our results clearly highlight the importance of amino
acid composition in differentiating between these groups.
This model can also be an important tool to understand the
differences between M, N, and O. Hence a step towards
assisting various wet lab techniques in devising novel drugs
and therapeutic agents against these three. The correlation of
M, N, O with their amino acid composition explored here
can be useful to obtain better insight about these proteins.
Their molecular and physiological roles along with the
substrate affinity can also be correlated with amino acid
composition. The accuracy of predicting group MNO was
found to be 99.93%.The overall accuracy of the amino acid
composition-based classifier for classifying the three groups
MN, MO, NO was 88.64%, 96.11% and 89.02%
respectively. It proved that M, N, O can be correlated with
amino acid composition and can be easily distinguished on
this basis.
The receiver operating characteristics (ROC) score is
usually used as the primary measure of the machine learnin
method performance and provided an overview of the
possible cut-off levels in the test performance. The roc curve
for group MNO are depicted in Graph 5 which shows that
majority of instances fall in the true positive range.
IV. CONCLUSIONS
The SVM model developed here is computationally
efficient and effective in predicting and classifying the HIV-
1 groups M, N, O. This is evident from the accuracy (99%)
in the results. Further the amino acid composition contains
very significant information for discriminating the classes of
above proteins. Such type of prediction systems can be very
useful for understanding the above groups in a better way.
This method will nicely complement the existing wet lab
methods. It will assist in assigning the correct class to which
these proteins belong or classifying them as either of the
three groups. The prediction method presented here may be
useful for the annotation of the piled-up proteomic data. The
author awaits discovery of more of the groups and subtypes
of HIV1 in the future so that accuracy of the prediction
model can be increased further and a server developed for
public use.
ACKNOWLEDGMENT
The authors are highly thankful to the Department of
Biotechnology, New Delhi, India and M.P. Council of
Science and Technology M.P., Bhopal, India for providing
support in the form of Bioinformatics infrastructure facility
to carry out the research work,
Anubha Dubey
Department of Bioinformatics, MANIT, Bhopal,
India
anubhadubey@rediffmail.com
https://www.facebook.com/Kanishk-103603547852455/?modal=admin_todo_tour

Social media is bold.

Social media is young.

Social media raises questions.

Social media is not satisfied with an answer.

Social media looks at the big picture.

Social media is interested in every detail.

social media is curious.

Social media is free.

Social media is irreplaceable.

But never irrelevant.

Social media is you.

(With input from news agency language)

If you like this story, share it with a friend!

We are a non-profit organization. Help us financially to keep our journalism free from government and corporate pressure.

Ticker	Company	Price (INR)
BSE:RELIANCE	Reliance Industries Ltd.	₹2,100.00
NSE:TCS	Tata Consultancy Services Ltd.	₹3,200.50
BSE:HDFCBANK	HDFC Bank Ltd.	₹1,600.75
NSE:INFY	Infosys Ltd.	₹1,500.25

Index	Price (INR)
BSE SENSEX	₹60,000.00
NIFTY 50	₹18,000.50

STOCK MARKET UPDATE

Ticker

SVM Model for Amino Acid Composition Based Classification of HIV-1 Groups,

Posted by kanishkSocialMedia - BROADCASTING MEDIA PRODUCTION COMPANY,PUBLISHER,

You may like these posts

Post a Comment

0 Comments

Custom Real-Time Chart Widget

Social Plugin

Welcome to the E-community Research Platform

Thanks for registering to watch the prize!

This action is not currently available.

INDIAN STOCK MARKET

India Stock Market Visualization with News

Latest News

भारत में मौलिक अधिकार का सिद्धांत !

trading symbols with their prices and change percentages, alo

stock prices of Indian and international companies.

topcreativeformat

freedom-of-speech"

Promotions For Women

Subscribe Us

Most Popular

Categories

Fundamental Rating Widget trading signals and an overall rating.

Volunteer Sign up Form

World Map India

SUPPORT US ; freedom-of-speech & Expression

journalism

Support Us - ksm team

Indian Stock Market Ticker

Indices

SUPPORT US ; Freedom of Speech

World Fresh Updates

Most Viewed

WORLD MAP

Total Pageviews

Blog Archive

Business

Featured Posts

Recent Posts

kanishksocialmedia

Comments

Follow Us

Labels

Followers

FIVERR

INDO

JalewaAds -

kanishksocialmedia -

market stocks NSC

Labels

Press Release Distribution

stock prices of Indian and international companies.

BOND BANKING

Search This Blog

Popular Posts

Popular Posts

Footer Menu Widget