Predictive Analysis for Big Data: Extension of Classification and Regression Trees Algorithm
Since its inception, predictive analysis has revolutionized the IT industry through its robustness and decision-making facilities. It involves the application of a set of data processing techniques and algorithms in order to create predictive models. Its principle is based on finding relationships between explanatory variables and the predicted variables. Past occurrences are exploited to predict and to derive the unknown outcome. With the advent of big data, many studies have suggested the use of predictive analytics in order to process and analyze big data. Nevertheless, they have been curbed by the limits of classical methods of predictive analysis in case of a large amount of data. In fact, because of their volumes, their nature (semi or unstructured) and their variety, it is impossible to analyze efficiently big data via classical methods of predictive analysis. The authors attribute this weakness to the fact that predictive analysis algorithms do not allow the parallelization and distribution of calculation. In this paper, we propose to extend the predictive analysis algorithm, Classification And Regression Trees (CART), in order to adapt it for big data analysis. The major changes of this algorithm are presented and then a version of the extended algorithm is defined in order to make it applicable for a huge quantity of data.
 Siegel E. (2016). PredictiveAnalysis,Library of Congress Cataloging-in-Publication Data
 McAfee M., E Brynjolfsson, E. (2012). Big Data: The Management Revolution, Harvard Business Review.
 Miller Thomas, R. (2013). Modeling Techniques in Predictive Analytics Business Problems and Solutions, Pearson Education, Inc.
 Siegel, E. (2013). Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. John Wiley & Sons, p107.
 Gutierrez, P., Yves Gerardy, Y. (2016) Causal Inference and Uplift Modeling A review of the literature, JMLR: Workshop and Conference Proceedings 67:1–13
 Mitchell, T. (1997). Decision Tree Learning. Dans M. Hill (Éd.), Machine Learning (pp. 52-80).
 Gerstner, A. A. (2004). Cognitive Navigation Based on Nonuniform Gabor Space Sampling, Unsupervised Growing Networks, and Reinforcement Learning (Vol. 15).
 Cristanini, N. Shawe-Taylor, J. (2000). Cristanini, NAn Introduction to Support Vector Machines and other kernel-based learning methods. Cambridge University Press.
 Grete Heinz, L. J. (2003). Exploring Relationships in Body Dimensions. Journal of Statistics Education, 11(2).
 Sanjeev Arora, E. H. (2012). The Multiplicative Weights Update Method: a Meta Algorithm and Applications.
 Avi Levy, H. R. (2016). Deterministic Discrepancy Minimization via the Multiplicative Weight Update Method.
 Altman, N. S. (1992). An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3), 175-185.
 Culler, D. E. (1999). Parallel Computer Architecture - A Hardware/Software Approach. Morgan Kaufmann Publishers.
 Dean, J., & Ghemawat, S. (2008). Mapreduce: Simplified data processing on large clusters. 51(1), 107-113.
 Définiitons, méthode et qualité * Indice de Gini.(s.d.). Consulted mars 25th, 2017, on Insee: insee.fr
 Alain, G. (2007). Exploration d'un algorithme génétique et d'un arbre de décision à des fins de catégorisation. Québec: Université de Québec à trois rivières.