Meta-Learning for Hierarchical Classification and Applications in Bioinformatics
Hierarchical classification is a special type of classification task where the class labels are organised into a hierarchy, with more generic class labels being ancestors of more specific ones. Meta-learning for classification-algorithm recommendation consists of recommending to the user a classification algorithm, from a pool of candidate algorithms, for a dataset, based on the past performance of the candidate algorithms in other datasets. Meta-learning is normally used in conventional, non-hierarchical classification. By contrast, this paper proposes a meta-learning approach for more challenging task of hierarchical classification, and evaluates it in a large number of bioinformatics datasets. Hierarchical classification is especially relevant for bioinformatics problems, as protein and gene functions tend to be organised into a hierarchy of class labels. This work proposes meta-learning approach for
recommending the best hierarchical classification algorithm to a
hierarchical classification dataset. This work’s contributions are: 1)
proposing an algorithm for splitting hierarchical datasets into
new datasets to increase the number of meta-instances, 2) proposing
meta-features for hierarchical classification, and 3) interpreting
decision-tree meta-models for hierarchical classification algorithm
ELISA Based hTSH Assessment Using Two Sensitive and Specific Anti-hTSH Polyclonal Antibodies
Production of specific antibody responses against hTSH is a cumbersome process due to the high identity between the hTSH and the other members of the glycoprotein hormone family (FSH, LH and HCG) and the high identity between the human hTSH and host animals for antibody production. Therefore, two polyclonal antibodies were purified against two recombinant proteins. Four possible ELISA tests were designed based on these antibodies. These ELISA tests were checked against hTSH and other glycoprotein hormones, and their sensitivity and specificity were assessed. Bioinformatics tools were used to analyze the immunological properties. After the immunogen region selection from hTSH protein, c terminal of B hTSH was selected and applied. Two recombinant genes, with these cut pieces (first: two repeats of C terminal of B hTSH, second: tetanous toxin+B hTSH C terminal), were designed and sub-cloned into the pET32a expression vector. Standard methods were used for protein expression, purification, and verification. Thereafter, immunizations of the white New Zealand rabbits were performed and the serums of them were used for antibody titration, purification and characterization. Then, four ELISA tests based on two antibodies were employed to assess the hTSH and other glycoprotein hormones. The results of these assessments were compared with standard amounts. The obtained results indicated that the desired antigens were successfully designed, sub-cloned, expressed, confirmed and used for in vivo immunization. The raised antibodies were capable of specific and sensitive hTSH detection, while the cross reactivity with the other members of the glycoprotein hormone family was minimum. Among the four designed tests, the test in which the antibody against first protein was used as capture antibody, and the antibody against second protein was used as detector antibody did not show any hook effect up to 50 miu/l. Both proteins have the ability to induce highly sensitive and specific antibody responses against the hTSH. One of the antibody combinations of these antibodies has the highest sensitivity and specificity in hTSH detection.
Identification of Disease Causing DNA Motifs in Human DNA Using Clustering Approach
Studying DNA (deoxyribonucleic acid) sequence is useful in biological processes and it is applied in the fields such as diagnostic and forensic research. DNA is the hereditary information in human and almost all other organisms. It is passed to their generations. Earlier stage detection of defective DNA sequence may lead to many developments in the field of Bioinformatics. Nowadays various tedious techniques are used to identify defective DNA. The proposed work is to analyze and identify the cancer-causing DNA motif in a given sequence. Initially the human DNA sequence is separated as k-mers using k-mer separation rule. The separated k-mers are clustered using Self Organizing Map (SOM). Using Levenshtein distance measure, cancer associated DNA motif is identified from the k-mer clusters. Experimental results of this work indicate the presence or absence of cancer causing DNA motif. If the cancer associated DNA motif is found in DNA, it is declared as the cancer disease causing DNA sequence. Otherwise the input human DNA is declared as normal sequence. Finally, elapsed time is calculated for finding the presence of cancer causing DNA motif using clustering formation. It is compared with normal process of finding cancer causing DNA motif. Locating cancer associated motif is easier in cluster formation process than the other one. The proposed work will be an initiative aid for finding genetic disease related research.
Intellectual Property Protection of CRISPR Related Technologies
CRISPR research has the potential to completely transform life science, agriculture, live-stock and the health care industry. The Intellectual Property derived from its research has raised significant attention in the academic as well as the biopharmaceutical industry culminating an urgent need for strategic IP protection. We review the rudimentary concepts and key competitors of CRISPR technologies as well as the paramount strategies for intellectual property protection. Further, we elaborate on prosecution issues related to CRISPR patents as well as possible solutions to various patent laws, interferences and litigation. Finally, we address how the bioinformatics of the CRISPR technology begs an inquiry into issues of privacy and a host of ethical concerns.
Imputation Technique for Feature Selection in Microarray Data Set
Analyzing DNA microarray data sets is a great
challenge, which faces the bioinformaticians due to the complication
of using statistical and machine learning techniques. The challenge
will be doubled if the microarray data sets contain missing data,
which happens regularly because these techniques cannot deal with
missing data. One of the most important data analysis process on
the microarray data set is feature selection. This process finds the
most important genes that affect certain disease. In this paper, we
introduce a technique for imputing the missing data in microarray
data sets while performing feature selection.
Polymorphic Marker Designed from Bioinformatics Sequences Related to Cell Wall Strength for Discrimination of Mangosteen (Garcinia mangostana L.) Clones Resistant to Gamboge Disorder
Gamboge disorder (GD) or fruit damage by the yellow sap is a major problem in mangosteen. Mangosteen plants varied in the level of GD, from very low or non GD to low, moderate and high GD. However it was difficult to differentiate between GD and non GD plants because evaluation of the disorder is strongly influenced by environment. In this study we investigated the usefulness of primer designed from bioinformatics related to cell wall strength, termed as MCWS, to predict GD. Plant materials used were 28 mangosteen plants selected based on percentage of GD categorized as high, moderate, low and very low or non GD. The result showed that the specific DNA fragments were absent in the high GD accessions. The MCWS marker suggests as a novel polymorphic marker for GD in mangosteen as well as a marker for detect variability in mangosteen as apomictic plant.
Biological Data Integration using SOA
Nowadays scientific data is inevitably digital and
stored in a wide variety of formats in heterogeneous systems.
Scientists need to access an integrated view of remote or local
heterogeneous data sources with advanced data accessing, analyzing,
and visualization tools. This research suggests the use of Service
Oriented Architecture (SOA) to integrate biological data from
different data sources. This work shows SOA will solve the problems
that facing integration process and if the biologist scientists can
access the biological data in easier way. There are several methods to
implement SOA but web service is the most popular method. The
Microsoft .Net Framework used to implement proposed architecture.
A Hybridization of Constructive Beam Search with Local Search for Far From Most Strings Problem
The Far From Most Strings Problem (FFMSP) is to obtain a string which is far from as many as possible of a given set of strings. All the input and the output strings are of the same length, and two strings are said to be far if their hamming distance is greater than or equal to a given positive integer. FFMSP belongs to the class of sequences consensus problems which have applications in molecular biology. The problem is NP-hard; it does not admit a constant-ratio approximation either, unless P = NP. Therefore, in addition to exact and approximate algorithms, (meta)heuristic algorithms have been proposed for the problem in recent years. On the other hand, in the recent years, hybrid algorithms have been proposed and successfully used for many hard problems in a variety of domains. In this paper, a new metaheuristic algorithm, called Constructive Beam and Local Search (CBLS), is investigated for the problem, which is a hybridization of constructive beam search and local search algorithms. More specifically, the proposed algorithm consists of two phases, the first phase is to obtain several candidate solutions via the constructive beam search and the second phase is to apply local search to the candidate solutions obtained by the first phase. The best solution found is returned as the final solution to the problem. The proposed algorithm is also similar to memetic algorithms in the sense that both use local search to further improve individual solutions. The CBLS algorithm is compared with the most recent published algorithm for the problem, GRASP, with significantly positive results; the improvement is by order of magnitudes in most cases.
A Novel Approach for Protein Classification Using Fourier Transform
Discovering new biological knowledge from the highthroughput biological data is a major challenge to bioinformatics today. To address this challenge, we developed a new approach for protein classification. Proteins that are evolutionarily- and thereby functionally- related are said to belong to the same classification. Identifying protein classification is of fundamental importance to document the diversity of the known protein universe. It also provides a means to determine the functional roles of newly discovered protein sequences. Our goal is to predict the functional classification of novel protein sequences based on a set of features extracted from each protein sequence. The proposed technique used datasets extracted from the Structural Classification of Proteins (SCOP) database. A set of spectral domain features based on Fast Fourier Transform (FFT) is used. The proposed classifier uses multilayer back propagation (MLBP) neural network for protein classification. The maximum classification accuracy is about 91% when applying the classifier to the full four levels of the SCOP database. However, it reaches a maximum of 96% when limiting the classification to the family level. The classification results reveal that spectral domain contains information that can be used for classification with high accuracy. In addition, the results emphasize that sequence similarity measures are of great importance especially at the family level.
Multi-Agent Systems Applied in the Modeling and Simulation of Biological Problems: A Case Study in Protein Folding
Multi-agent system approach has proven to be an effective and appropriate abstraction level to construct whole models of a diversity of biological problems, integrating aspects which can be found both in "micro" and "macro" approaches when modeling this type of phenomena. Taking into account these considerations, this paper presents the important computational characteristics to be gathered into a novel bioinformatics framework built upon a multiagent architecture. The version of the tool presented herein allows studying and exploring complex problems belonging principally to structural biology, such as protein folding. The bioinformatics framework is used as a virtual laboratory to explore a minimalist model of protein folding as a test case. In order to show the laboratory concept of the platform as well as its flexibility and adaptability, we studied the folding of two particular sequences, one of 45-mer and another of 64-mer, both described by an HP model (only hydrophobic and polar residues) and coarse grained 2D-square lattice. According to the discussion section of this piece of work, these two sequences were chosen as breaking points towards the platform, in order to determine the tools to be created or improved in such a way to overcome the needs of a particular computation and analysis of a given tough sequence. The backwards philosophy herein is that the continuous studying of sequences provides itself important points to be added into the platform, to any time improve its efficiency, as is demonstrated herein.
Endometrial Cancer Recognition via EEG Dependent upon 14-3-3 Protein Leading to an Ontological Diagnosis
The purpose of my research proposal is to
demonstrate that there is a relationship between EEG and
The above relationship is based on an Aristotelian Syllogism;
since it is known that the 14-3-3 protein is related to the electrical
activity of the brain via control of the flow of Na+ and K+ ions and
since it is also known that many types of cancer are associated with
14-3-3 protein, it is possible that there is a relationship between EEG
and cancer. This research will be carried out by well-defined
diagnostic indicators, obtained via the EEG, using signal processing
procedures and pattern recognition tools such as neural networks in
order to recognize the endometrial cancer type. The current research
shall compare the findings from EEG and hysteroscopy performed on
women of a wide age range. Moreover, this practice could be
expanded to other types of cancer. The implementation of this
methodology will be completed with the creation of an ontology.
This ontology shall define the concepts existing in this research-s
domain and the relationships between them. It will represent the
types of relationships between hysteroscopy and EEG findings.
Bioinformatics Profiling of Missense Mutations
The ability to distinguish missense nucleotide
substitutions that contribute to harmful effect from those that do not
is a difficult problem usually accomplished through functional in
vivo analyses. In this study, instead current biochemical methods, the
effects of missense mutations upon protein structure and function
were assayed by means of computational methods and information
from the databases. For this order, the effects of new missense
mutations in exon 5 of PTEN gene upon protein structure and
function were examined. The gene coding for PTEN was identified
and localized on chromosome region 10q23.3 as the tumor
suppressor gene. The utilization of these methods were shown that
c.319G>A and c.341T>G missense mutations that were recognized in
patients with breast cancer and Cowden disease, could be pathogenic.
This method could be use for analysis of missense mutation in others
Exploring Dimensionality, Systematic Mutations and Number of Contacts in Simple HP ab-initio Protein Folding Using a Blackboard-based Agent Platform
A computational platform is presented in this
contribution. It has been designed as a virtual laboratory to be used
for exploring optimization algorithms in biological problems. This
platform is built on a blackboard-based agent architecture. As a test
case, the version of the platform presented here is devoted to the
study of protein folding, initially with a bead-like description of the
chain and with the widely used model of hydrophobic and polar
residues (HP model). Some details of the platform design are
presented along with its capabilities and also are revised some
explorations of the protein folding problems with different types of
discrete space. It is also shown the capability of the platform to
incorporate specific tools for the structural analysis of the runs in
order to understand and improve the optimization process.
Accordingly, the results obtained demonstrate that the ensemble of
computational tools into a single platform is worthwhile by itself,
since experiments developed on it can be designed to fulfill different
levels of information in a self-consistent fashion. By now, it is being
explored how an experiment design can be useful to create a
computational agent to be included within the platform. These
inclusions of designed agents –or software pieces– are useful for the
better accomplishment of the tasks to be developed by the platform.
Clearly, while the number of agents increases the new version of the
virtual laboratory thus enhances in robustness and functionality.
A Bayesian Kernel for the Prediction of Protein- Protein Interactions
Understanding proteins functions is a major goal in
the post-genomic era. Proteins usually work in context of other
proteins and rarely function alone. Therefore, it is highly relevant to
study the interaction partners of a protein in order to understand its
function. Machine learning techniques have been widely applied to
predict protein-protein interactions. Kernel functions play an
important role for a successful machine learning technique. Choosing
the appropriate kernel function can lead to a better accuracy in a
binary classifier such as the support vector machines. In this paper,
we describe a Bayesian kernel for the support vector machine to
predict protein-protein interactions. The use of Bayesian kernel can
improve the classifier performance by incorporating the probability
characteristic of the available experimental protein-protein
interactions data that were compiled from different sources. In
addition, the probabilistic output from the Bayesian kernel can assist
biologists to conduct more research on the highly predicted
interactions. The results show that the accuracy of the classifier has
been improved using the Bayesian kernel compared to the standard
SVM kernels. These results imply that protein-protein interaction can
be predicted using Bayesian kernel with better accuracy compared to
the standard SVM kernels.
Grid Computing in Physics and Life Sciences
Certain sciences such as physics, chemistry or biology,
have a strong computational aspect and use computing infrastructures
to advance their scientific goals. Often, high performance and/or high
throughput computing infrastructures such as clusters and computational
Grids are applied to satisfy computational needs. In addition,
these sciences are sometimes characterised by scientific collaborations
requiring resource sharing which is typically provided by Grid
approaches. In this article, I discuss Grid computing approaches in
High Energy Physics as well as in bioinformatics and highlight some
of my experience in both scientific domains.
Mining Genes Relations in Microarray Data Combined with Ontology in Colon Cancer Automated Diagnosis System
MATCH project  entitle the development of an
automatic diagnosis system that aims to support treatment of colon
cancer diseases by discovering mutations that occurs to tumour
suppressor genes (TSGs) and contributes to the development of
cancerous tumours. The constitution of the system is based on a)
colon cancer clinical data and b) biological information that will be
derived by data mining techniques from genomic and proteomic
sources The core mining module will consist of the popular, well
tested hybrid feature extraction methods, and new combined
algorithms, designed especially for the project. Elements of rough
sets, evolutionary computing, cluster analysis, self-organization maps
and association rules will be used to discover the annotations
between genes, and their influence on tumours -.
The methods used to process the data have to address their high
complexity, potential inconsistency and problems of dealing with the
missing values. They must integrate all the useful information
necessary to solve the expert's question. For this purpose, the system
has to learn from data, or be able to interactively specify by a domain
specialist, the part of the knowledge structure it needs to answer a
given query. The program should also take into account the
importance/rank of the particular parts of data it analyses, and adjusts
the used algorithms accordingly.
Addressing Scalability Issues of Named Entity Recognition Using Multi-Class Support Vector Machines
This paper explores the scalability issues associated
with solving the Named Entity Recognition (NER) problem using
Support Vector Machines (SVM) and high-dimensional features. The
performance results of a set of experiments conducted using binary
and multi-class SVM with increasing training data sizes are
examined. The NER domain chosen for these experiments is the
biomedical publications domain, especially selected due to its
importance and inherent challenges. A simple machine learning
approach is used that eliminates prior language knowledge such as
part-of-speech or noun phrase tagging thereby allowing for its
applicability across languages. No domain-specific knowledge is
included. The accuracy measures achieved are comparable to those
obtained using more complex approaches, which constitutes a
motivation to investigate ways to improve the scalability of multiclass
SVM in order to make the solution more practical and useable.
Improving training time of multi-class SVM would make support
vector machines a more viable and practical machine learning
solution for real-world problems with large datasets. An initial
prototype results in great improvement of the training time at the
expense of memory requirements.
Comparison of Domain and Hydrophobicity Features for the Prediction of Protein-Protein Interactions using Support Vector Machines
The protein domain structure has been widely used as the most informative sequence feature to computationally predict protein-protein interactions. However, in a recent study, a research group has reported a very high accuracy of 94% using hydrophobicity feature. Therefore, in this study we compare and verify the usefulness of protein domain structure and hydrophobicity properties as the sequence features. Using the Support Vector Machines (SVM) as the learning system, our results indicate that both features achieved accuracy of nearly 80%. Furthermore, domains structure had receiver operating characteristic (ROC) score of 0.8480 with running time of 34 seconds, while hydrophobicity had ROC score of 0.8159 with running time of 20,571 seconds (5.7 hours). These results indicate that protein-protein interaction can be predicted from domain structure with reliable accuracy and acceptable running time.
A New Predictor of Coding Regions in Genomic Sequences using a Combination of Different Approaches
Identifying protein coding regions in DNA sequences is a basic step in the location of genes. Several approaches based on signal processing tools have been applied to solve this problem, trying to achieve more accurate predictions. This paper presents a new predictor that improves the efficacy of three techniques that use the Fourier Transform to predict coding regions, and that could be computed using an algorithm that reduces the computation load. Some ideas about the combination of the predictor with other methods are discussed. ROC curves are used to demonstrate the efficacy of the proposed predictor, based on the computation of 25 DNA sequences from three different organisms.
EEG Waves Classifier using Wavelet Transform and Fourier Transform
The electroencephalograph (EEG) signal is one of the most widely signal used in the bioinformatics field due to its rich information about human tasks. In this work EEG waves classification is achieved using the Discrete Wavelet Transform DWT with Fast Fourier Transform (FFT) by adopting the normalized EEG data. The DWT is used as a classifier of the EEG wave's frequencies, while FFT is implemented to visualize the EEG waves in multi-resolution of DWT. Several real EEG data sets (real EEG data for both normal and abnormal persons) have been tested and the results improve the validity of the proposed technique.
One-Class Support Vector Machines for Protein-Protein Interactions Prediction
Predicting protein-protein interactions represent a key step in understanding proteins functions. This is due to the fact that proteins usually work in context of other proteins and rarely function alone. Machine learning techniques have been applied to predict protein-protein interactions. However, most of these techniques address this problem as a binary classification problem. Although it is easy to get a dataset of interacting proteins as positive examples, there are no experimentally confirmed non-interacting proteins to be considered as negative examples. Therefore, in this paper we solve this problem as a one-class classification problem using one-class support vector machines (SVM). Using only positive examples (interacting protein pairs) in training phase, the one-class SVM achieves accuracy of about 80%. These results imply that protein-protein interaction can be predicted using one-class classifier with comparable accuracy to the binary classifiers that use artificially constructed negative examples.
Computational Method for Annotation of Protein Sequence According to Gene Ontology Terms
Annotation of a protein sequence is pivotal for the understanding of its function. Accuracy of manual annotation provided by curators is still questionable by having lesser evidence strength and yet a hard task and time consuming. A number of computational methods including tools have been developed to tackle this challenging task. However, they require high-cost hardware, are difficult to be setup by the bioscientists, or depend on time intensive and blind sequence similarity search like Basic Local Alignment Search Tool. This paper introduces a new method of assigning highly correlated Gene Ontology terms of annotated protein sequences to partially annotated or newly discovered protein sequences. This method is fully based on Gene Ontology data and annotations. Two problems had been identified to achieve this method. The first problem relates to splitting the single monolithic Gene Ontology RDF/XML file into a set of smaller files that can be easy to assess and process. Thus, these files can be enriched with protein sequences and Inferred from Electronic Annotation evidence associations. The second problem involves searching for a set of semantically similar Gene Ontology terms to a given query. The details of macro and micro problems involved and their solutions including objective of this study are described. This paper also describes the protein sequence annotation and the Gene Ontology. The methodology of this study and Gene Ontology based protein sequence annotation tool namely extended UTMGO is presented. Furthermore, its basic version which is a Gene Ontology browser that is based on semantic similarity search is also introduced.