A Hybrid Feature Selection and Deep Learning Algorithm for Cancer Disease Classification
Learning from very big datasets is a significant problem for most present data mining and machine learning algorithms. MicroRNA (miRNA) is one of the important big genomic and non-coding datasets presenting the genome sequences. In this paper, a hybrid method for the classification of the miRNA data is proposed. Due to the variety of cancers and high number of genes, analyzing the miRNA dataset has been a challenging problem for researchers. The number of features corresponding to the number of samples is high and the data suffer from being imbalanced. The feature selection method has been used to select features having more ability to distinguish classes and eliminating obscures features. Afterward, a Convolutional Neural Network (CNN) classifier for classification of cancer types is utilized, which employs a Genetic Algorithm to highlight optimized hyper-parameters of CNN. In order to make the process of classification by CNN faster, Graphics Processing Unit (GPU) is recommended for calculating the mathematic equation in a parallel way. The proposed method is tested on a real-world dataset with 8,129 patients, 29 different types of tumors, and 1,046 miRNA biomarkers, taken from The Cancer Genome Atlas (TCGA) database.
 Calin, G.A., et al., Frequent deletions and down-regulation of micro-RNA genes miR15 and miR16 at 13q14 in chronic lymphocytic leukemia. Proceedings of the National Academy of Sciences, 2002. 99(24): p. 15524-15529.
 Peng, Y. and C.M. Croce, The role of MicroRNAs in human cancer. Signal transduction and targeted therapy, 2016. 1: p. 15004.
 Sauter, E.R. and N. Patel, Body fluid micro (mi) RNAs as biomarkers for human cancer. Journal of Nucleic Acids Investigation, 2011. 2(1): p. e1-e1.
 Bartels, C.L. and G.J. Tsongalis, MicroRNAs: novel biomarkers for human cancer. Clinical chemistry, 2009. 55(4): p. 623-631.
 Cortez, M.A., et al., MicroRNAs in body fluids—the mix of hormones and biomarkers. Nature reviews Clinical oncology, 2011. 8(8): p. 467.
 Liu, B., et al., Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. Journal of theoretical biology, 2015. 385: p. 153-159.
 Jain, S., S. Shukla, and R. Wadhvani, Dynamic selection of normalization techniques using data complexity measures. Expert Systems with Applications, 2018. 106: p. 252-262.
 Sarbazi-Azad, S., M.S. Abadeh, and M.I.N. Abadi. Feature Selection in Microarray Gene Expression Data Using Fisher Discriminant Ratio. in 2018 8th International Conference on Computer and Knowledge Engineering (ICCKE). 2018. IEEE.
 Lopez-Rincon, A., et al., Evolutionary optimization of convolutional neural networks for cancer miRNA biomarkers classification. Applied Soft Computing, 2018. 65: p. 91-100.
 Alshammari, T., et al., Evaluating machine learning techniques for activity classification in smart home environments. Int. J. Comput. Electr. Autom. Control Inf. Eng, 2018. 12: p. 48-54.
 Pedregosa, F., et al., Scikit-learn: Machine learning in Python. Journal of machine learning research, 2011. 12(Oct): p. 2825-2830.
 Hastie, T., et al., Multi-class adaboost. Statistics and its Interface, 2009. 2(3): p. 349-360.
 Breiman, L., Pasting small votes for classification in large databases and on-line. Machine learning, 1999. 36(1-2): p. 85-103.
 Friedman, J.H., Greedy function approximation: a gradient boosting machine. Annals of statistics, 2001: p. 1189-1232.
 Breiman, L., Random forests. Machine learning, 2001. 45(1): p. 5-32.
 Cox, D.R., The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 1958. 20(2): p. 215-232.
 Crammer, K., et al., Online passive-aggressive algorithms. Journal of Machine Learning Research, 2006. 7(Mar): p. 551-585.
 Tikhonov, A.N. On the stability of inverse problems. in Dokl. Akad. Nauk SSSR. 1943.
 Hearst, M.A., et al., Support vector machines. IEEE Intelligent Systems and their applications, 1998. 13(4): p. 18-28.
 Altman, N.S., An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 1992. 46(3): p. 175-185.
 Tibshirani, R., et al., Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 2002. 99(10): p. 6567-6572.
 Boser, B.E., I.M. Guyon, and V.N. Vapnik. A training algorithm for optimal margin classiﬁers. in Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory. 2003.
 Breiman, L., et al., Classification and regression trees–crc press. Boca Raton, Florida, 1984.
 Geurts, P., D. Ernst, and L. Wehenkel, Extremely randomized trees. Machine learning, 2006. 63(1): p. 3-42.
 Potharaju, S.P. and M. Sreedevi, Distributed feature selection (DFS) strategy for microarray gene expression data to improve the classification performance. Clinical Epidemiology and Global Health, 2019. 7(2): p. 171-176.