|Commenced in January 2007||Frequency: Monthly||Edition: International||Paper Count: 34|
Data mining technique used in the field of clustering is a subject of active research and assists in biological pattern recognition and extraction of new knowledge from raw data. Clustering means the act of partitioning an unlabeled dataset into groups of similar objects. Each group, called a cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Several clustering methods are based on partitional clustering. This category attempts to directly decompose the dataset into a set of disjoint clusters leading to an integer number of clusters that optimizes a given criterion function. The criterion function may emphasize a local or a global structure of the data, and its optimization is an iterative relocation procedure. The K-Means algorithm is one of the most widely used partitional clustering techniques. Since K-Means is extremely sensitive to the initial choice of centers and a poor choice of centers may lead to a local optimum that is quite inferior to the global optimum, we propose a strategy to initiate K-Means centers. The improved K-Means algorithm is compared with the original K-Means, and the results prove how the efficiency has been significantly improved.
Endothelial Progenitor Cell (EPC) based therapies continue to be of interest to treat ischemic events based on their proven role to promote blood vessel formation and thus tissue re-vascularisation. Current strategies for the production of clinical-grade EPCs requires the in vitro isolation of EPCs from peripheral blood followed by cell expansion to provide sufficient quantities EPCs for cell therapy. This study aims to examine the use of different biomolecules to significantly improve the current strategy of EPC capture and expansion on collagen type I (Col I). In this study, four different biomolecules were immobilised on a surface and then investigated for their capacity to support EPC capture and proliferation. First, a cell microarray platform was fabricated by coating a glass surface with epoxy functional allyl glycidyl ether plasma polymer (AGEpp) to mediate biomolecule binding. The four candidate biomolecules tested were Col I, collagen type II (Col II), collagen type IV (Col IV) and vascular endothelial growth factor A (VEGF-A), which were arrayed on the epoxy-functionalised surface using a non-contact printer. The surrounding area between the printed biomolecules was passivated with polyethylene glycol-bisamine (A-PEG) to prevent non-specific cell attachment. EPCs were seeded onto the microarray platform and cell numbers quantified after 1 h (to determine capture) and 72 h (to determine proliferation). All of the extracellular matrix (ECM) biomolecules printed demonstrated an ability to capture EPCs within 1 h of cell seeding with Col II exhibiting the highest level of attachment when compared to the other biomolecules. Interestingly, Col IV exhibited the highest increase in EPC expansion after 72 h when compared to Col I, Col II and VEGF-A. These results provide information for significant improvement in the capture and expansion of human EPC for further application.
In recent years, there has been an explosion in the rate of using technology that help discovering the diseases. For example, DNA microarrays allow us for the first time to obtain a "global" view of the cell. It has great potential to provide accurate medical diagnosis, to help in finding the right treatment and cure for many diseases. Various classification algorithms can be applied on such micro-array datasets to devise methods that can predict the occurrence of Leukemia disease. In this study, we compared the classification accuracy and response time among eleven decision tree methods and six rule classifier methods using five performance criteria. The experiment results show that the performance of Random Tree is producing better result. Also it takes lowest time to build model in tree classifier. The classification rules algorithms such as nearest- neighbor-like algorithm (NNge) is the best algorithm due to the high accuracy and it takes lowest time to build model in classification.
A DNA microarray technology is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increase the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. It is handled by clustering which reveals the natural structures and identifying the interesting patterns in the underlying data. In this paper, gene based clustering in gene expression data is proposed using Cuckoo Search with Differential Evolution (CS-DE). The experiment results are analyzed with gene expression benchmark datasets. The results show that CS-DE outperforms CS in benchmark datasets. To find the validation of the clustering results, this work is tested with one internal and one external cluster validation indexes.
Development of a method to estimate gene functions is an important task in bioinformatics. One of the approaches for the annotation is the identification of the metabolic pathway that genes are involved in. Since gene expression data reflect various intracellular phenomena, those data are considered to be related with genes’ functions. However, it has been difficult to estimate the gene function with high accuracy. It is considered that the low accuracy of the estimation is caused by the difficulty of accurately measuring a gene expression. Even though they are measured under the same condition, the gene expressions will vary usually. In this study, we proposed a feature extraction method focusing on the variability of gene expressions to estimate the genes' metabolic pathway accurately. First, we estimated the distribution of each gene expression from replicate data. Next, we calculated the similarity between all gene pairs by KL divergence, which is a method for calculating the similarity between distributions. Finally, we utilized the similarity vectors as feature vectors and trained the multiclass SVM for identifying the genes' metabolic pathway. To evaluate our developed method, we applied the method to budding yeast and trained the multiclass SVM for identifying the seven metabolic pathways. As a result, the accuracy that calculated by our developed method was higher than the one that calculated from the raw gene expression data. Thus, our developed method combined with KL divergence is useful for identifying the genes' metabolic pathway.
Analyzing DNA microarray data sets is a great challenge, which faces the bioinformaticians due to the complication of using statistical and machine learning techniques. The challenge will be doubled if the microarray data sets contain missing data, which happens regularly because these techniques cannot deal with missing data. One of the most important data analysis process on the microarray data set is feature selection. This process finds the most important genes that affect certain disease. In this paper, we introduce a technique for imputing the missing data in microarray data sets while performing feature selection.
Array-based gene expression analysis is a powerful tool to profile expression of genes and to generate information on therapeutic effects of new anti-cancer compounds. Anti-apoptotic effect of thymoquinone was studied in MCF7 breast cancer cell line using gene expression profiling with cDNA microarray. The purity and yield of RNA samples were determined using RNeasyPlus Mini kit. The Agilent RNA 6000 NanoLabChip kit evaluated the quantity of the RNA samples. AffinityScript RT oligo-dT promoter primer was used to generate cDNA strands. T7 RNA polymerase was used to convert cDNA to cRNA. The cRNA samples and human universal reference RNA were labelled with Cy-3-CTP and Cy-5-CTP, respectively. Feature Extraction and GeneSpring softwares analysed the data. The single experiment analysis revealed involvement of 64 pathways with up-regulated genes and 78 pathways with downregulated genes. The MAPK and p38-MAPK pathways were inhibited due to the up-regulation of PTPRR gene. The inhibition of p38-MAPK suggested up-regulation of TGF-ß pathway. Inhibition of p38-MAPK caused up-regulation of TP53 and down-regulation of Bcl2 genes indicating involvement of intrinsic apoptotic pathway. Down-regulation of CARD16 gene as an adaptor molecule regulated CASP1 and suggested necrosis-like programmed cell death and involvement of caspase in apoptosis. Furthermore, down-regulation of GPCR, EGF-EGFR signalling pathways suggested reduction of ER. Involvement of AhR pathway which control cytochrome P450 and glucuronidation pathways showed metabolism of Thymoquinone. The findings showed differential expression of several genes in apoptosis pathways with thymoquinone treatment in estrogen receptor-positive breast cancer cells.
Microarray technology is universally used in the study of disease diagnosis using gene expression levels. The main shortcoming of gene expression data is that it includes thousands of genes and a small number of samples. Abundant methods and techniques have been proposed for tumor classification using microarray gene expression data. Feature or gene selection methods can be used to mine the genes that directly involve in the classification and to eliminate irrelevant genes. In this paper statistical measures like T-Statistics, Signal-to-Noise Ratio (SNR) and F-Statistics are used to rank the genes. The ranked genes are used for further classification. Particle Swarm Optimization (PSO) algorithm and Shuffled Frog Leaping (SFL) algorithm are used to find the significant genes from the top-m ranked genes. The Naïve Bayes Classifier (NBC) is used to classify the samples based on the significant genes. The proposed work is applied on Lung and Ovarian datasets. The experimental results show that the proposed method achieves 100% accuracy in all the three datasets and the results are compared with previous works.
Tumor classification is a key area of research in the field of bioinformatics. Microarray technology is commonly used in the study of disease diagnosis using gene expression levels. The main drawback of gene expression data is that it contains thousands of genes and a very few samples. Feature selection methods are used to select the informative genes from the microarray. These methods considerably improve the classification accuracy. In the proposed method, Genetic Algorithm (GA) is used for effective feature selection. Informative genes are identified based on the T-Statistics, Signal-to-Noise Ratio (SNR) and F-Test values. The initial candidate solutions of GA are obtained from top-m informative genes. The classification accuracy of k-Nearest Neighbor (kNN) method is used as the fitness function for GA. In this work, kNN and Support Vector Machine (SVM) are used as the classifiers. The experimental results show that the proposed work is suitable for effective feature selection. With the help of the selected genes, GA-kNN method achieves 100% accuracy in 4 datasets and GA-SVM method achieves in 5 out of 10 datasets. The GA with kNN and SVM methods are demonstrated to be an accurate method for microarray based tumor classification.
A series of microarray experiments produces observations of differential expression for thousands of genes across multiple conditions. Principal component analysis(PCA) has been widely used in multivariate data analysis to reduce the dimensionality of the data in order to simplify subsequent analysis and allow for summarization of the data in a parsimonious manner. PCA, which can be implemented via a singular value decomposition(SVD), is useful for analysis of microarray data. For application of PCA using SVD we use the DNA microarray data for the small round blue cell tumors(SRBCT) of childhood by Khan et al.(2001). To decide the number of components which account for sufficient amount of information we draw scree plot. Biplot, a graphic display associated with PCA, reveals important features that exhibit relationship between variables and also the relationship of variables with observations.
Cancers could normally be marked by a number of differentially expressed genes which show enormous potential as biomarkers for a certain disease. Recent years, cancer classification based on the investigation of gene expression profiles derived by high-throughput microarrays has widely been used. The selection of discriminative genes is, therefore, an essential preprocess step in carcinogenesis studies. In this paper, we have proposed a novel gene selector using information-theoretic measures for biological discovery. This multivariate filter is a four-stage framework through the analyses of feature relevance, feature interdependence, feature redundancy-dependence and subset rankings, and having been examined on the colon cancer data set. Our experimental result show that the proposed method outperformed other information theorem based filters in all aspect of classification errors and classification performance.
In molecular biology, microarray technology is widely and successfully utilized to efficiently measure gene activity. If working with less studied organisms, methods to design custom-made microarray probes are available. One design criterion is to select probes with minimal melting temperature variances thus ensuring similar hybridization properties. If the microarray application focuses on the investigation of metabolic pathways, it is not necessary to cover the whole genome. It is more efficient to cover each metabolic pathway with a limited number of genes. Firstly, an approach is presented which minimizes the overall melting temperature variance of selected probes for all genes of interest. Secondly, the approach is extended to include the additional constraints of covering all pathways with a limited number of genes while minimizing the overall variance. The new optimization problem is solved by a bottom-up programming approach which reduces the complexity to make it computationally feasible. The new method is exemplary applied for the selection of microarray probes in order to cover all fungal secondary metabolite gene clusters for Aspergillus terreus.
Wavelet neural networks (WNNs) have emerged as a vital alternative to the vastly studied multilayer perceptrons (MLPs) since its first implementation. In this paper, we applied various clustering algorithms, namely, K-means (KM), Fuzzy C-means (FCM), symmetry-based K-means (SBKM), symmetry-based Fuzzy C-means (SBFCM) and modified point symmetry-based K-means (MPKM) clustering algorithms in choosing the translation parameter of a WNN. These modified WNNs are further applied to the heterogeneous cancer classification using benchmark microarray data and were compared against the conventional WNN with random initialization method. Experimental results showed that a WNN classifier with the MPKM algorithm is more precise than the conventional WNN as well as the WNNs with other clustering algorithms.
We describe a novel method for removing noise (in wavelet domain) of unknown variance from microarrays. The method is based on the following procedure: We apply 1) Bidimentional Discrete Wavelet Transform (DWT-2D) to the Noisy Microarray, 2) scaling and rounding to the coefficients of the highest subbands (to obtain integer and positive coefficients), 3) bit-slicing to the new highest subbands (to obtain bit-planes), 4) then we apply the Systholic Boolean Orthonormalizer Network (SBON) to the input bit-plane set and we obtain two orthonormal otput bit-plane sets (in a Boolean sense), we project a set on the other one, by means of an AND operation, and then, 5) we apply re-assembling, and, 6) rescaling. Finally, 7) we apply Inverse DWT-2D and reconstruct a microarray from the modified wavelet coefficients. Denoising results compare favorably to the most of methods in use at the moment.
DNA microarrays allow the measurement of expression levels for a large number of genes, perhaps all genes of an organism, within a number of different experimental samples. It is very much important to extract biologically meaningful information from this huge amount of expression data to know the current state of the cell because most cellular processes are regulated by changes in gene expression. Association rule mining techniques are helpful to find association relationship between genes. Numerous association rule mining algorithms have been developed to analyze and associate this huge amount of gene expression data. This paper focuses on some of the popular association rule mining algorithms developed to analyze gene expression data.
The goal of Gene Expression Analysis is to understand the processes that underlie the regulatory networks and pathways controlling inter-cellular and intra-cellular activities. In recent times microarray datasets are extensively used for this purpose. The scope of such analysis has broadened in recent times towards reconstruction of gene networks and other holistic approaches of Systems Biology. Evolutionary methods are proving to be successful in such problems and a number of such methods have been proposed. However all these methods are based on processing of genotypic information. Towards this end, there is a need to develop evolutionary methods that address phenotypic interactions together with genotypic interactions. We present a novel evolutionary approach, called Phenomic algorithm, wherein the focus is on phenotypic interaction. We use the expression profiles of genes to model the interactions between them at the phenotypic level. We apply this algorithm to the yeast sporulation dataset and show that the algorithm can identify gene networks with relative ease.
Developing an accurate classifier for high dimensional microarray datasets is a challenging task due to availability of small sample size. Therefore, it is important to determine a set of relevant genes that classify the data well. Traditionally, gene selection method often selects the top ranked genes according to their discriminatory power. Often these genes are correlated with each other resulting in redundancy. In this paper, we have proposed a hybrid method using feature ranking and wrapper method (Genetic Algorithm with multiclass SVM) to identify a set of relevant genes that classify the data more accurately. A new fitness function for genetic algorithm is defined that focuses on selecting the smallest set of genes that provides maximum accuracy. Experiments have been carried on four well-known datasets1. The proposed method provides better results in comparison to the results found in the literature in terms of both classification accuracy and number of genes selected.