Predictive Analysis for Big Data: Extension of Classification and Regression Trees Algorithm
Since its inception, predictive analysis has revolutionized the IT industry through its robustness and decision-making facilities. It involves the application of a set of data processing techniques and algorithms in order to create predictive models. Its principle is based on finding relationships between explanatory variables and the predicted variables. Past occurrences are exploited to predict and to derive the unknown outcome. With the advent of big data, many studies have suggested the use of predictive analytics in order to process and analyze big data. Nevertheless, they have been curbed by the limits of classical methods of predictive analysis in case of a large amount of data. In fact, because of their volumes, their nature (semi or unstructured) and their variety, it is impossible to analyze efficiently big data via classical methods of predictive analysis. The authors attribute this weakness to the fact that predictive analysis algorithms do not allow the parallelization and distribution of calculation. In this paper, we propose to extend the predictive analysis algorithm, Classification And Regression Trees (CART), in order to adapt it for big data analysis. The major changes of this algorithm are presented and then a version of the extended algorithm is defined in order to make it applicable for a huge quantity of data.
Microclimate Variations in Rio de Janeiro Related to Massive Public Transportation
Urban public transportation in Rio de Janeiro is based on bus lines, powered by diesel, and four limited metro lines that support only some neighborhoods. This work presents an infrastructure built to better understand microclimate variations related to massive urban transportation in some specific areas of the city. The use of sensor nodes with small analytics capacity provides environmental information to population or public services. The analyses of data collected from a few small sensors positioned near some heavy traffic streets show the harmful impact due to poor bus route plan.
Real-Time Data Stream Partitioning over a Sliding Window in Real-Time Spatial Big Data
In recent years, real-time spatial applications, like
location-aware services and traffic monitoring, have become more
and more important. Such applications result dynamic environments
where data as well as queries are continuously moving. As a result,
there is a tremendous amount of real-time spatial data generated
every day. The growth of the data volume seems to outspeed the
advance of our computing infrastructure. For instance, in real-time
spatial Big Data, users expect to receive the results of each query
within a short time period without holding in account the load
of the system. But with a huge amount of real-time spatial data
generated, the system performance degrades rapidly especially in
overload situations. To solve this problem, we propose the use of
data partitioning as an optimization technique. Traditional horizontal
and vertical partitioning can increase the performance of the system
and simplify data management. But they remain insufficient for
real-time spatial Big data; they can’t deal with real-time and
stream queries efficiently. Thus, in this paper, we propose a novel
data partitioning approach for real-time spatial Big data named
VPA-RTSBD (Vertical Partitioning Approach for Real-Time Spatial
Big data). This contribution is an implementation of the Matching
algorithm for traditional vertical partitioning. We find, firstly, the
optimal attribute sequence by the use of Matching algorithm. Then,
we propose a new cost model used for database partitioning, for
keeping the data amount of each partition more balanced limit and
for providing a parallel execution guarantees for the most frequent
queries. VPA-RTSBD aims to obtain a real-time partitioning scheme
and deals with stream data. It improves the performance of query
execution by maximizing the degree of parallel execution. This affects
QoS (Quality Of Service) improvement in real-time spatial Big Data
especially with a huge volume of stream data. The performance of
our contribution is evaluated via simulation experiments. The results
show that the proposed algorithm is both efficient and scalable, and
that it outperforms comparable algorithms.
A Study of the Adaptive Reuse for School Land Use Strategy: An Application of the Analytic Network Process and Big Data
In today's popularity and progress of information technology, the big data set and its analysis are no longer a major conundrum. Now, we could not only use the relevant big data to analysis and emulate the possible status of urban development in the near future, but also provide more comprehensive and reasonable policy implementation basis for government units or decision-makers via the analysis and emulation results as mentioned above. In this research, we set Taipei City as the research scope, and use the relevant big data variables (e.g., population, facility utilization and related social policy ratings) and Analytic Network Process (ANP) approach to implement in-depth research and discussion for the possible reduction of land use in primary and secondary schools of Taipei City. In addition to enhance the prosperous urban activities for the urban public facility utilization, the final results of this research could help improve the efficiency of urban land use in the future. Furthermore, the assessment model and research framework established in this research also provide a good reference for schools or other public facilities land use and adaptive reuse strategies in the future.
Integration of Big Data to Predict Transportation for Smart Cities
The Intelligent transportation system is essential to build smarter cities. Machine learning based transportation prediction could be highly promising approach by delivering invisible aspect visible. In this context, this research aims to make a prototype model that predicts transportation network by using big data and machine learning technology. In detail, among urban transportation systems this research chooses bus system. The research problem that existing headway model cannot response dynamic transportation conditions. Thus, bus delay problem is often occurred. To overcome this problem, a prediction model is presented to fine patterns of bus delay by using a machine learning implementing the following data sets; traffics, weathers, and bus statues. This research presents a flexible headway model to predict bus delay and analyze the result. The prototyping model is composed by real-time data of buses. The data are gathered through public data portals and real time Application Program Interface (API) by the government. These data are fundamental resources to organize interval pattern models of bus operations as traffic environment factors (road speeds, station conditions, weathers, and bus information of operating in real-time). The prototyping model is designed by the machine learning tool (RapidMiner Studio) and conducted tests for bus delays prediction. This research presents experiments to increase prediction accuracy for bus headway by analyzing the urban big data. The big data analysis is important to predict the future and to find correlations by processing huge amount of data. Therefore, based on the analysis method, this research represents an effective use of the machine learning and urban big data to understand urban dynamics.
Exploring the Activity Fabric of an Intelligent Environment with Hierarchical Hidden Markov Theory
The Internet of Things (IoT) was designed for widespread convenience. With the smart tag and the sensing network, a large quantity of dynamic information is immediately presented in the IoT. Through the internal communication and interaction, meaningful objects provide real-time services for users. Therefore, the service with appropriate decision-making has become an essential issue. Based on the science of human behavior, this study employed the environment model to record the time sequences and locations of different behaviors and adopted the probability module of the hierarchical Hidden Markov Model for the inference. The statistical analysis was conducted to achieve the following objectives: First, define user behaviors and predict the user behavior routes with the environment model to analyze user purposes. Second, construct the hierarchical Hidden Markov Model according to the logic framework, and establish the sequential intensity among behaviors to get acquainted with the use and activity fabric of the intelligent environment. Third, establish the intensity of the relation between the probability of objects’ being used and the objects. The indicator can describe the possible limitations of the mechanism. As the process is recorded in the information of the system created in this study, these data can be reused to adjust the procedure of intelligent design services.
Urban Big Data: An Experimental Approach to Building-Value Estimation Using Web-Based Data
Current real-estate value estimation, difficult for laymen, usually is performed by specialists. This paper presents an automated estimation process based on big data and machine-learning technology that calculates influences of building conditions on real-estate price measurement. The present study analyzed actual building sales sample data for Nonhyeon-dong, Gangnam-gu, Seoul, Korea, measuring the major influencing factors among the various building conditions. Further to that analysis, a prediction model was established and applied using RapidMiner Studio, a graphical user interface (GUI)-based tool for derivation of machine-learning prototypes. The prediction model is formulated by reference to previous examples. When new examples are applied, it analyses and predicts accordingly. The analysis process discerns the crucial factors effecting price increases by calculation of weighted values. The model was verified, and its accuracy determined, by comparing its predicted values with actual price increases.
Forthcoming Big Data on Smart Buildings and Cities: An Experimental Study on Correlations among Urban Data
Cities are complex systems of diverse and inter-tangled activities. These activities and their complex interrelationships create diverse urban phenomena. And such urban phenomena have considerable influences on the lives of citizens. This research aimed to develop a method to reveal the causes and effects among diverse urban elements in order to enable better understanding of urban activities and, therefrom, to make better urban planning strategies. Specifically, this study was conducted to solve a data-recommendation problem found on a Korean public data homepage. First, a correlation analysis was conducted to find the correlations among random urban data. Then, based on the results of that correlation analysis, the weighted data network of each urban data was provided to people. It is expected that the weights of urban data thereby obtained will provide us with insights into cities and show us how diverse urban activities influence each other and induce feedback.
Building a Scalable Telemetry Based Multiclass Predictive Maintenance Model in R
Many organizations are faced with the challenge of how to analyze and build Machine Learning models using their sensitive telemetry data. In this paper, we discuss how users can leverage the power of R without having to move their big data around as well as a cloud based solution for organizations willing to host their data in the cloud. By using ScaleR technology to benefit from parallelization and remote computing or R Services on premise or in the cloud, users can leverage the power of R at scale without having to move their data around.
Exploring Influence Range of Tainan City Using Electronic Toll Collection Big Data
Big Data has been attracted a lot of attentions in many fields for analyzing research issues based on a large number of maternal data. Electronic Toll Collection (ETC) is one of Intelligent Transportation System (ITS) applications in Taiwan, used to record starting point, end point, distance and travel time of vehicle on the national freeway. This study, taking advantage of ETC big data, combined with urban planning theory, attempts to explore various phenomena of inter-city transportation activities. ETC, one of government's open data, is numerous, complete and quick-update. One may recall that living area has been delimited with location, population, area and subjective consciousness. However, these factors cannot appropriately reflect what people’s movement path is in daily life. In this study, the concept of "Living Area" is replaced by "Influence Range" to show dynamic and variation with time and purposes of activities. This study uses data mining with Python and Excel, and visualizes the number of trips with GIS to explore influence range of Tainan city and the purpose of trips, and discuss living area delimited in current. It dialogues between the concepts of "Central Place Theory" and "Living Area", presents the new point of view, integrates the application of big data, urban planning and transportation. The finding will be valuable for resource allocation and land apportionment of spatial planning.
A Proposal for U-City (Smart City) Service Method Using Real-Time Digital Map
Recently, technologies based on three-dimensional (3D) space information are being developed and quality of life is improving as a result. Research on real-time digital map (RDM) is being conducted now to provide 3D space information. RDM is a service that creates and supplies 3D space information in real time based on location/shape detection. Research subjects on RDM include the construction of 3D space information with matching image data, complementing the weaknesses of image acquisition using multi-source data, and data collection methods using big data. Using RDM will be effective for space analysis using 3D space information in a U-City and for other space information utilization technologies.
Big Data: Concepts, Technologies and Applications in the Public Sector
Big Data (BD) is associated with a new generation of technologies and architectures which can harness the value of extremely large volumes of very varied data through real time processing and analysis. It involves changes in (1) data types, (2) accumulation speed, and (3) data volume. This paper presents the main concepts related to the BD paradigm, and introduces architectures and technologies for BD and BD sets. The integration of BD with the Hadoop Framework is also underlined. BD has attracted a lot of attention in the public sector due to the newly emerging technologies that allow the availability of network access. The volume of different types of data has exponentially increased. Some applications of BD in the public sector in Romania are briefly presented.
Opening up Government Datasets for Big Data Analysis to Support Policy Decisions
Policy makers are increasingly looking to make evidence-based decisions. Evidence-based decisions have historically used rigorous methodologies of empirical studies by research institutes, as well as less reliable immediate survey/polls often with limited sample sizes. As we move into the era of Big Data analytics, policy makers are looking to different methodologies to deliver reliable empirics in real-time. The question is not why did these people do this for the last 10 years, but why are these people doing this now, and if the this is undesirable, and how can we have an impact to promote change immediately. Big data analytics rely heavily on government data that has been released in to the public domain. The open data movement promises greater productivity and more efficient delivery of services; however, Australian government agencies remain reluctant to release their data to the general public. This paper considers the barriers to releasing government data as open data, and how these barriers might be overcome.
Comparison of Different k-NN Models for Speed Prediction in an Urban Traffic Network
A database that records average traffic speeds measured at five-minute intervals for all the links in the traffic network of a metropolitan city. While learning from this data the models that can predict future traffic speed would be beneficial for the applications such as the car navigation system, building predictive models for every link becomes a nontrivial job if the number of links in a given network is huge. An advantage of adopting k-nearest neighbor (k-NN) as predictive models is that it does not require any explicit model building. Instead, k-NN takes a long time to make a prediction because it needs to search for the k-nearest neighbors in the database at prediction time. In this paper, we investigate how much we can speed up k-NN in making traffic speed predictions by reducing the amount of data to be searched for without a significant sacrifice of prediction accuracy. The rationale behind this is that we had a better look at only the recent data because the traffic patterns not only repeat daily or weekly but also change over time. In our experiments, we build several different k-NN models employing different sets of features which are the current and past traffic speeds of the target link and the neighbor links in its up/down-stream. The performances of these models are compared by measuring the average prediction accuracy and the average time taken to make a prediction using various amounts of data.
Visual Text Analytics Technologies for Real-Time Big Data: Chronological Evolution and Issues
New approaches to analyze and visualize data stream in real-time basis is important in making a prompt decision by the decision maker. Financial market trading and surveillance, large-scale emergency response and crowd control are some example scenarios that require real-time analytic and data visualization. This situation has led to the development of techniques and tools that support humans in analyzing the source data. With the emergence of Big Data and social media, new techniques and tools are required in order to process the streaming data. Today, ranges of tools which implement some of these functionalities are available. In this paper, we present chronological evolution evaluation of technologies for supporting of real-time analytic and visualization of the data stream. Based on the past research papers published from 2002 to 2014, we gathered the general information, main techniques, challenges and open issues. The techniques for streaming text visualization are identified based on Text Visualization Browser in chronological order. This paper aims to review the evolution of streaming text visualization techniques and tools, as well as to discuss the problems and challenges for each of identified tools.
Mining Big Data in Telecommunications Industry: Challenges, Techniques, and Revenue Opportunity
Mining big data represents a big challenge nowadays. Many types of research are concerned with mining massive amounts of data and big data streams. Mining big data faces a lot of challenges including scalability, speed, heterogeneity, accuracy, provenance and privacy. In telecommunication industry, mining big data is like a mining for gold; it represents a big opportunity and maximizing the revenue streams in this industry. This paper discusses the characteristics of big data (volume, variety, velocity and veracity), data mining techniques and tools for handling very large data sets, mining big data in telecommunication and the benefits and opportunities gained from them.
A Methodology for Investigating Public Opinion Using Multilevel Text Analysis
Recently, many users have begun to frequently share
their opinions on diverse issues using various social media. Therefore,
numerous governments have attempted to establish or improve
national policies according to the public opinions captured from
various social media. In this paper, we indicate several limitations of
the traditional approaches to analyze public opinion on science and
technology and provide an alternative methodology to overcome these
limitations. First, we distinguish between the science and technology
analysis phase and the social issue analysis phase to reflect the fact that
public opinion can be formed only when a certain science and
technology is applied to a specific social issue. Next, we successively
apply a start list and a stop list to acquire clarified and interesting
results. Finally, to identify the most appropriate documents that fit
with a given subject, we develop a new logical filter concept that
consists of not only mere keywords but also a logical relationship
among the keywords. This study then analyzes the possibilities for the
practical use of the proposed methodology thorough its application to
discover core issues and public opinions from 1,700,886 documents
comprising SNS, blogs, news, and discussions.
A Methodology for Automatic Diversification of Document Categories
Recently, numerous documents including large
volumes of unstructured data and text have been created because of the
rapid increase in the use of social media and the Internet. Usually,
these documents are categorized for the convenience of users. Because
the accuracy of manual categorization is not guaranteed, and such
categorization requires a large amount of time and incurs huge costs.
Many studies on automatic categorization have been conducted to help
mitigate the limitations of manual categorization. Unfortunately, most
of these methods cannot be applied to categorize complex documents
with multiple topics because they work on the assumption that
individual documents can be categorized into single categories only.
Therefore, to overcome this limitation, some studies have attempted to
categorize each document into multiple categories. However, the
learning process employed in these studies involves training using a
multi-categorized document set. These methods therefore cannot be
applied to the multi-categorization of most documents unless
multi-categorized training sets using traditional multi-categorization
algorithms are provided. To overcome this limitation, in this study, we
review our novel methodology for extending the category of a
single-categorized document to multiple categorizes, and then
introduce a survey-based verification scenario for estimating the
accuracy of our automatic categorization methodology.
A Simple User Administration View of Computing Clusters
In this paper a very simple and effective user
administration view of computing clusters systems is implemented in
order of friendly provide the configuration and monitoring of
distributed application executions. The user view, the administrator
view, and an internal control module create an illusionary
management environment for better system usability. The
architecture, properties, performance, and the comparison with others
software for cluster management are briefly commented.
Agile Methodology for Modeling and Design of Data Warehouses -AM4DW-
The organizations have structured and unstructured information in different formats, sources, and systems. Part of these come from ERP under OLTP processing that support the information system, however these organizations in OLAP processing level, presented some deficiencies, part of this problematic lies in that does not exist interesting into extract knowledge from their data sources, as also the absence of operational capabilities to tackle with these kind of projects. Data Warehouse and its applications are considered as non-proprietary tools, which are of great interest to business intelligence, since they are repositories basis for creating models or patterns (behavior of customers, suppliers, products, social networks and genomics) and facilitate corporate decision making and research. The following paper present a structured methodology, simple, inspired from the agile development models as Scrum, XP and AUP. Also the models object relational, spatial data models, and the base line of data modeling under UML and Big data, from this way sought to deliver an agile methodology for the developing of data warehouses, simple and of easy application. The methodology naturally take into account the application of process for the respectively information analysis, visualization and data mining, particularly for patterns generation and derived models from the objects facts structured.
Wireless Transmission of Big Data Using Novel Secure Algorithm
This paper presents a novel algorithm for secure,
reliable and flexible transmission of big data in two hop wireless
networks using cooperative jamming scheme. Two hop wireless
networks consist of source, relay and destination nodes. Big data has
to transmit from source to relay and from relay to destination by
deploying security in physical layer. Cooperative jamming scheme
determines transmission of big data in more secure manner by
protecting it from eavesdroppers and malicious nodes of unknown
location. The novel algorithm that ensures secure and energy balance
transmission of big data, includes selection of data transmitting
region, segmenting the selected region, determining probability ratio
for each node (capture node, non-capture and eavesdropper node) in
every segment, evaluating the probability using binary based
evaluation. If it is secure transmission resume with the two- hop
transmission of big data, otherwise prevent the attackers by
cooperative jamming scheme and transmit the data in two-hop
Natural Language News Generation from Big Data
In this paper, we introduce an NLG application for the automatic creation of ready-to-publish texts from big data. The resulting fully automatic generated news stories have a high resemblance to the style in which the human writer would draw up such a story. Topics include soccer games, stock exchange market reports, and weather forecasts. Each generated text is unique. Readyto-publish stories written by a computer application can help humans to quickly grasp the outcomes of big data analyses, save timeconsuming pre-formulations for journalists and cater to rather small audiences by offering stories that would otherwise not exist.
Applications of Big Data in Education
Big Data and analytics have gained a huge momentum
in recent years. Big Data feeds into the field of Learning Analytics
(LA) that may allow academic institutions to better understand the
learners’ needs and proactively address them. Hence, it is important
to have an understanding of Big Data and its applications. The
purpose of this descriptive paper is to provide an overview of Big
Data, the technologies used in Big Data, and some of the applications
of Big Data in education. Additionally, it discusses some of the
concerns related to Big Data and current research trends. While Big
Data can provide big benefits, it is important that institutions
understand their own needs, infrastructure, resources, and limitation
before jumping on the Big Data bandwagon.
An In-Depth Analysis of Open Data Portals as an Emerging Public E-Service
Governments collect and produce large amounts of
data. Increasingly, governments worldwide have started to implement
open data initiatives and also launch open data portals to enable the
release of these data in open and reusable formats. Therefore, a large
number of open data repositories, catalogues and portals have been
emerging in the world. The greater availability of interoperable and
linkable open government data catalyzes secondary use of such data,
so they can be used for building useful applications which leverage
their value, allow insight, provide access to government services, and
support transparency. The efficient development of successful open
data portals makes it necessary to evaluate them systematic, in order
to understand them better and assess the various types of value they
generate, and identify the required improvements for increasing this
value. Thus, the attention of this paper is directed particularly to the
field of open data portals. The main aim of this paper is to compare
the selected open data portals on the national level using content
analysis and propose a new evaluation framework, which further
improves the quality of these portals. It also establishes a set of
considerations for involving businesses and citizens to create eservices
and applications that leverage on the datasets available from
A System for Analyzing and Eliciting Public Grievances Using Cache Enabled Big Data
The system for analyzing and eliciting public
grievances serves its main purpose to receive and process all sorts of
complaints from the public and respond to users. Due to the more
number of complaint data becomes big data which is difficult to store
and process. The proposed system uses HDFS to store the big data
and uses MapReduce to process the big data. The concept of cache
was applied in the system to provide immediate response and timely
action using big data analytics. Cache enabled big data increases the
response time of the system. The unstructured data provided by the
users are efficiently handled through map reduce algorithm. The
processing of complaints takes place in the order of the hierarchy of
the authority. The drawbacks of the traditional database system used
in the existing system are set forth by our system by using Cache
enabled Hadoop Distributed File System. MapReduce framework
codes have the possible to leak the sensitive data through
computation process. We propose a system that add noise to the
output of the reduce phase to avoid signaling the presence of
sensitive data. If the complaints are not processed in the ample time,
then automatically it is forwarded to the higher authority. Hence it
ensures assurance in processing. A copy of the filed complaint is sent
as a digitally signed PDF document to the user mail id which serves
as a proof. The system report serves to be an essential data while
making important decisions based on legislation.
Big Data: Big Challenges to Privacy and Data Protection
This paper seeks to analyse the benefits of big data
and more importantly the challenges it pose to the subject of privacy
and data protection. First, the nature of big data will be briefly
deliberated before presenting the potential of big data in the present
days. Afterwards, the issue of privacy and data protection is
highlighted before discussing the challenges of implementing this
issue in big data. In conclusion, the paper will put forward the debate
on the adequacy of the existing legal framework in protecting
personal data in the era of big data.
Survey on Arabic Sentiment Analysis in Twitter
Large-scale data stream analysis has become one of
the important business and research priorities lately. Social networks
like Twitter and other micro-blogging platforms hold an enormous
amount of data that is large in volume, velocity and variety.
Extracting valuable information and trends out of these data would
aid in a better understanding and decision-making. Multiple analysis
techniques are deployed for English content. Moreover, one of the
languages that produce a large amount of data over social networks
and is least analyzed is the Arabic language. The proposed paper is a
survey on the research efforts to analyze the Arabic content in
Twitter focusing on the tools and methods used to extract the
sentiments for the Arabic content on Twitter.
Comparative Analysis of Diverse Collection of Big Data Analytics Tools
Over the past era, there have been a lot of efforts and
studies are carried out in growing proficient tools for performing
various tasks in big data. Recently big data have gotten a lot of
publicity for their good reasons. Due to the large and complex
collection of datasets it is difficult to process on traditional data
processing applications. This concern turns to be further mandatory
for producing various tools in big data. Moreover, the main aim of
big data analytics is to utilize the advanced analytic techniques
besides very huge, different datasets which contain diverse sizes from
terabytes to zettabytes and diverse types such as structured or
unstructured and batch or streaming. Big data is useful for data sets
where their size or type is away from the capability of traditional
relational databases for capturing, managing and processing the data
with low-latency. Thus the out coming challenges tend to the
occurrence of powerful big data tools. In this survey, a various
collection of big data tools are illustrated and also compared with the
Big Data Strategy for Telco: Network Transformation
Big data has the potential to improve the quality of services; enable infrastructure that businesses depend on to adapt continually and efficiently; improve the performance of employees; help organizations better understand customers; and reduce liability risks. Analytics and marketing models of fixed and mobile operators are falling short in combating churn and declining revenue per user. Big Data presents new method to reverse the way and improve profitability. The benefits of Big Data and next-generation network, however, are more exorbitant than improved customer relationship management. Next generation of networks are in a prime position to monetize rich supplies of customer information—while being mindful of legal and privacy issues. As data assets are transformed into new revenue streams will become integral to high performance.