Abstract
With the development of antimicrobials, microbes have adapted and become resistant
to previous antimicrobial agents. Hence WHO recommended complete list of critically
important antimicrobials, highly important and important antimicrobials. So there is
a need to classify critically important antimicrobials for human medicine so these can
be used only for humans. Therefore machine learning model is developed in this paper
to classify critically important antimicrobials based on their amino acid composition
with great accuracy.
Keywords: antimicrobials, WHO, machine learning, amino acid composition,
Introduction
The science and practice of the diagnosis, treatment,
and prevention of disease is called medicine. Properties of medicine are
maintenance and restoration of health by the preventing and treating the
ill effects. They are responsible for killing or slow down the microbial
growth. Any kind of bacteria, viruses etc that are not visible to naked
eyes are called micro-organisms or microbes. Some category of
microbe is available in Table 1.
Table 1 Variety of microbes with example and their infection
Microbe Example Type of infection caused
Bacteria Staphylococcus aureus, etc Some staph infections
Virus Inuenza Flu
Fungi Candida albicans, etc Yeast infections
Parasites Plasmodium falciparum, etc Malaria
For treating human diseases different variety of antimicrobial
classes are used. These antimicrobials if used regularly develop
resistance called antimicrobial resistance. And the genes responsible
for resistance are called anti microbial resistance. For example,
the ndm-1 gene encodes resistance to the carbapenem family was
rst discovered in Klebsiella pneumonia that was isolated from an
infected person.1 Most of the AMR are hazardous to human health.
Characteristics by which antimicrobials are classied are as follows:
Characteristic 1 (C1): The class that treat serious ill effects caused
by bacteria in people.
Characteristic 2 (C2): The action of antimicrobials include:(a)
Bacteria that transmitted to humans from nonhuman sources, (b)
Bacteria that may acquire genes for resistance from sources other than
humans.
Antimicrobials vs antibiotics
The preventive measure in form of medicine are called antibiotics
which work against bacteria and treat bacterial infections. When
bacteria change their forms in response to the repeated use of
antibiotics develops antibiotic resistance. Broadly antimicrobial
resistance to drugs to treat infections caused by other microbes
such as parasites (e.g. malaria), viruses (e.g. HIV) and fungi (e.g.
Candida). Hence Antimicrobials are one of few alternatives for the
treatment of serious bacterial infections in humans that occupies an
important place in human medicine. Serious infections are likely
to result in signicant morbidity or mortality if left untreated.
Multidrug resistance is also the outcomes of disease which relate to
the site of infection e.g. pneumonia, meningitis or the host e.g. infant,
antidepressant. The use of such antibacterial agents is preserved,
as loss of efcacy in these drugs due to the emergence of resistance
leads to signicant impact on human health, especially for people with
life-threatening infections. These are the alternatives for the treatment
of serious bacterial infections in human that play an important role in
human medicine. If infections left untreated there would be signicant
morbidity or mortality. Sometimes multidrug resistance would also
occur like pneumonia, meningitis etc. The antimicrobial agents that
used to treat diseases caused by bacteria are transmitted to humans
from non-human sources i.e. water, food, environment or animal.
These are considered as highly important antimicrobials because such
infections are most amenable to risk management. Nonhuman sources
and the bacteria causing human diseases are linked. Such example
includes non-typhoidal salmonella, campylobacter spp. E. coli etc.
This is called commensalism. The commensalisms themselves may
also be pathogenic in immuno suppressed hosts. The transfer of their
genes shows the transmission of AMR. Interpretation of categorization
of antimicrobial class:
Critically important: Antimicrobial classes which meet both C1
and C2 are termed critically important for human medicine.
Highly important: Antimicrobial classes which meet either C1 or
C2 are termed highly important for human medicine.
Important: Antimicrobial classes used in humans which meet
neither C1 nor C2 are termed important for human medicine. The list
below is meant to show examples of members of each class of drugs. There are many antimicrobials like Aminoglycocides, ansamycins,
carbapenems and other penems, Cephalosporins, Glycopeptides,
Glycylcyclines, lipopeptides, Macrolids and ketolids, monobactrum,
Oxazolidinones, Penicillins, Phosphonic acid derivatives,
Polymyxins, Quinolones, sulfones, Tetracyclines, Nitrofuratoins, etc
are classied according to their mode of action and above explained
three categories. All the details of these antimicrobials are explained
in Table 2 which also describes their signicance of treating disease
and their causative organism respectively,
Materials & methods
For classication of antimicrobials, machine learning (ML)
techniques are employed. Because it is good in data analysis and
model building. ML is a branch of articial intelligence3–9 it makes
system learn from data, identify patterns and make great decisions
without human interference. As there is huge amount of variety of
data computational processing are a need to understand huge data in
a better way for further use. These ML computational techniques are
cheaper and powerful tools to apply. Here in this paper author tries to
classify and develop model for critically important antimicrobials for
human medicine by support vector machines (SVM). It can be dened
as a discriminative classier means two objects or set of objects are
classied by a separating hyperplane. It could be said that, as labelled
training data (supervised learning) is given, the algorithm outputs
an optimal hyperplane which categorizes new examples. Hence
hyperplane is a line dividing a plane in two parts where in each class
lay in either side in a two dimensional space.10–16
Data
In this section, preparation of training and testing dataset is
described. The amino acid composition of all the protein sequences
are taken from PROCOS (Protein composition server).17 It is very
time consuming and accurate. Predictions of sub cellular localization
of proteins are also used amino acid composition as described in 4 But
due to importance of amino acids, related work was also done. It is
said that the fraction of each type of amino acid type within a protein
is called as amino acid composition.
total number of amino acid i
Amino acid composition=
total number of amino acid in a protein
equation1
After gathering all the protein sequence data which are called
peptides are divided into different groups called datasets. There are
three different datasets according to importance of antimicrobials.18
Datasets
Dataset 1: Critically important antimicrobials: The microbes’
protein data which is available in Uniprot database is taken. And there
amino acid composition is taken by PROCOS software as input for
SVM. These are called training set and are positive samples needed to
be classied. For testing we took negative samples of other enzymatic
group.
Dataset 2: Highly important antimicrobials: Same as dataset 1
dataset 2 is prepared.
Dataset 3: Important antimicrobials: similarly dataset 3 for
important antimicrobials are also prepared.
Negative samples examples: With respect to positive samples,
it requires negative interaction examples to process the positive
samples accurately, as the SVM is a discriminative approach. When
experimental methods do not report an interaction between two
proteins, it means there positive signal does not imply a negative
signal. Hence no interaction between amino acids. It is required that
real negative examples are of important part for providing better
results.
Feature selection with SVD: (SVD) is a method to reduce the
dimensions and select the most relevant and informative features.
Principal component analysis19,20 is also used for feature selection and
dimensionality reduction. The higher the value of linear combination
of attributes, the more important it is. For any feature corresponding
eigen-value for PCA or singular value for SVD is found. Since
singular value are good to choose for features. In this work SVD
has lower computational cost. In SVD, the row belongs to proteins
play good role in combination coefcients. In PCA the training
proteins are altogether calculated the covariance between attributes.
Suppose A={MO;ST} be the training dataset containing positive and
negative examples, a matrix of size d*l is generated where d=p+n, it
is the number of train vectors, p is the number of positive examples,
n=number of negative examples, l= length of each vector. After extracting amino acid composition of different datasets, these results
fed as input to Support vector machines and by performing feature
selection and outlier detection. It’s important to nd the hyperplane
which clearly distinguish are dataset from one another with respect to
their negatives. For each run of SVM the classier is developed and
their performance is measured.
Performance evaluation: The performance of our classier was
judged by 10 fold cross validation. The LIBSVM provides a parameter
selection tool using the RBF kernel: cross validation via grid search.
For each Dataset 1, Dataset 2, and Dataset 3 grid search is performed
using c and gamma. Test set was performed for 10% of all samples
and remainder samples are used for training. Generally SVM faces the
problem of “over- tting” where the system converges on the set of
rules but it can be solved efciently. The test set and train set trees are
identied properly. To know the correct classication cross validation
process is used. This requires for each run 10% of sample is used as
test set. Different rule set up test cases are classied. It was found
that which rule has the most beautiful predictive ability to improve
is raised as best model evaluator. Over tting of the data leads to the
pruning.21
Results & discussion
Machine learning algorithm for classication of antimicrobials for
human medicine is implemented in this paper. All the three datasets
run in LIBSVM. And best result is obtained in the form of model.
Model development
It is the nal step when the data is classied as wanted. After
labelling testing data and generating several classiers. It’s nal to
choose which t best classication and develop model for future use.
Figure 2 shows the model for critically important antimicrobials.
Figure 2 Model for critically important antimicrobials.
According to the model development in SVM, there c,g and
accuracy are calculated simultaneously and can be written in the form
of Table 3 and all the required details are described later in this paper.
Table 3 Support vector machine results
Dataset C G Accuracy
Dataset 1 120 0.007813 99.8012
Dataset 2 120 0.0025 99.5
Dataset 3 120 0.0078 98.5
Figure 2 & Table 3 proves better that are datasets are classied
accurately with great accuracy. As we focus on CIA, it was classied
with 99.8012% accuracy. And also proves for similar sequences.
Amino acid compositions are best suited to classify such sequences.
Detail description is as follows:
Accuracy can be calculated as: =
tp tn
tp tn fp fn
+
++ +
Where tp=all the true positives in the samples
tn=all the true negatives in the sample
fp=all the samples which behave as positive
fn=those samples which behave as negative
Precision and recall, accuracy all functions are inbuilt in LIBSVM.
By choosing correct c,g, software calculate all parameters and reect
the correct answer within minutes as per the volume of data. As the
result obtained clearly differentiate characteristics of antimicrobials
in three different groups. Any new antibiotic discovered can be
grouped in above dened these categories. The correct values of c,g
and accuracy of all the three datasets identied. The c and g are the
two parameters for RBF kernels. It can’t be judged which is best. But
the LIBSVM has the parameter selection tool which best nds the
c,g, and accuracy. If good (c) is identied by the classier then it is
better prediction. The prediction accuracy indicates the performance
on classifying an independent dataset. Hence it is good to know about
‘unknown” dataset. Again cross-validation is performed. In this n-fold
cross-validation the training set is rst divided into n-subsets of equal
size. It would work sequentially by (n-1) subsets. Therefore cross
validation is the percentage of data which is accurately classied.
This cross validation removes the over tting. The grid search
approach is used because (a) it avoids exhaustive parameter search by
approximations or heuristics, (b) Computational time is less as there
is only two parameters. (c) Both c and g are independent. Hence SVM
is one of the best computational methods which reduce the cost of CV
and best is biological data classication.
Conclusion
Machine learning being an active area of research requires experts
that handle data safely and understand the data as information retrieval
system. Here machine learning model is developed for antimicrobials
which are used in human medicine. Hence WHO initiates how to
recommend critically important antimicrobials for human medicine?
It’s a need to describe importance of human medicine publically.
So in this paper author well tried to classify critically important
antimicrobials for human medicine with great accuracy. Future
treatment should be given by seeing the effect of antimicrobials.
And any other microbe or antimicrobial is generated it should be
grouped according to its amino acid composition based category as
the machine learning model is being developed.
DR Anubha Dubey
Maulana Azad National Institute of Technology, Bhopal | MANIT · Department of Bioinformatics
MSc biotechnology, PhD bioinformatics,
phone No,9993210963
If you like
this story, share it with a friend! We are a non-profit organization.
Help us financially to keep our journalism free from government and
corporate pressure.
0 Comments