Data Mining
Chapter 14
Big Data
Big Data
• In 2001, Gartner defined Big data as "Data that contains greater
variety arriving in increasing volumes and with ever-higher
velocity”.
• This led to the formulation of the "three V's”.
• Big data refers to an avalanche of structured and unstructured data
that is endlessly flooding and from a variety of endless data sources.
• These data sets are too large to be analyzed with traditional analytical
tools and technologies but have a plethora of valuable insights hiding
underneath.
The “Vs” of
Big data
• Volume
• Velocity
• Variety
• Veracity
• Value
Volume
• To be classified as big data, the volume of the given data set must be
substantially larger than traditional data sets.
• These data sets are primarily composed of unstructured data with
limited structured and semistructured data.
• The unstructured data or the data with unknown value can be
collected from input sources such as webpages, search history, mobile
applications, and social media platforms.
• The size and customer base of the company is usually proportional to
the volume of the data acquired by the company.
Velocity
• The speed at which data can be gathered and acted upon the first to
the velocity of big data.
• Companies are increasingly using a combination of on-premise and
cloud-based servers to increase the speed of their data collection.
• The modern-day "Smart Products and Devices" require real-time
access to consumer data, in order to be able to provide them a more
engaging and enhanced user experience.
Variety
• Traditionally a data set would contain majority of structured data with
low volume of unstructured and semi-structured data, but the advent
of big data has given rise to new unstructured data types such as
video, text, audio that require sophisticated tools and technologies to
clean and process these data types to extract meaningful insights
from them.
Veracity
• Another "V" that must be considered for big data analysis is veracity.
This refers to the "trustworthiness or the quality" of the data.
• For example, social media platforms like "Facebook" and "Twitter"
with blogs and posts containing a hashtag, acronyms, and all kinds of
typing errors can significantly reduce the reliability and accuracy of
the data sets
Value
• Data has evolved as a currency of its own with intrinsic value.
• Just like traditional monetary currencies, the ultimate value of the big
data is directly proportional to the insight gathered from it.
History of Big Data
• The origin of large volumes of data can be traced back to the 1960s and 1970s
when the Third Industrial Revolution had just started to kick in, and the
development of relational databases had begun along with the construction
of data centers.
• But the concept of big data has recently taken center stage primarily since the
availability of free search engines like Google and Yahoo, free online
entertainment services like YouTube and social media platforms like
Facebook.
• In 2005, businesses started to recognize the incredible amount of user data
being generated through these platforms and services and in the same year
and open-source framework called "Hadoop”, was developed to gather and
analyze these large data dumps available to the companies.
History of Big Data (Cont.)
• During the same period, a non-relational or distributed database
called "NoSQL" started to gain popularity due to its ability to store
and extract unstructured data. "Hadoop" made it possible for the
companies to work with big data with high ease and at a relatively low
cost.
• Today with the rise of cutting-edge technology not only humans but
machines also generating data.
• The smart device technologies like “Internet of things” (IoT) and
“Internet of systems” (IoS) have skyrocketed the volume of big data.
History of Big Data (Cont.)
• Our everyday household objects and smart devices are connected to
the Internet and able to track and record our usage patterns as well as
our interactions with these products and feeds all this data directly
into the big data.
• The advent of machine learning technology has further increased the
volume of data generated on a daily basis.
• It is estimated that by 2020, "1.7 MB of data will be generated per
second per person”. As the big data will continue to grow, its usability
still has many horizons to cross.
Importance of big data
• To gain reliable and trustworthy information from a data set, it is very
• important to have a complete data set that has been made possible
with the use of big data technology.
• The more data we have, the more information and details can be
extracted out of it.
• To gain a 360 view of a problem and its underlying solutions, the
future of big data is very promising.
Importance of big data
(Cont.)
• Here are some examples of the use of big data:
• Product development
• Predictive maintenance
• Customer experience
• Fraud and compliance
• Operational efficiency
• Machine learning
• Drive innovation
Product development
• Large and small e-commerce businesses are increasingly relying upon
big data to understand customer demands and expectations.
• Companies can develop predictive models to launch new products and
services by using primary characteristics of their past and existing
products and services and generating a model describing the
relationship of those characteristics with the commercial success of
those products and services.
• For example, a leading fast manufacturing commercial goods company,
"Procter & Gamble”, extensively uses big data gathered from the social
media websites, test markets, and focus groups in preparation for their
new product launch.
Predictive maintenance
• In order to besides leave project potential mechanical and equipment
failures, a large volume of unstructured data such as error messages,
log entries and normal temperature of the machine must be analyzed
along with available structured data such as make and model of the
equipment and year of manufacturing.
• By analyzing this big data set using the required analytical tools,
companies can extend the shelf life of their equipment by preparing
for scheduled maintenance ahead of time and predicting future
occurrences of potential mechanical failures
Customer experience
• The smart customer is aware of all of the technological advancements
and is loyal only to the most engaging and enhanced user experience
available.
• This has triggered a race among the companies to provide unique
customer experiences analyzing the data gathered from customers’
interactions with the company’s products and services.
• Providing personalized recommendations and offers to reduce
customer churn rate and effectively kind words prospective leads into
paying customers.
Fraud and compliance
• Big data helps in identifying the data patterns and assessing historical
trends from previous fraudulent transactions to effectively detect and
prevent potentially fraudulent transactions.
• Banks, financial institutions, and online payment services like “PayPal”
are constantly monitoring and gathering customer transaction data in
an effort to prevent fraud.
Operational efficiency
• With the help of big data predictive analysis. companies can learn and
anticipate future demand and product trends by analyzing production
capacity, customer feedback, and data pertaining to top-selling items
and product returns to improve decision-making and produce
products that are in line with the current market trends.
Machine learning
• For a machine to be able to learn and train on its own, it requires a
humongous volume of data i.e. big data.
• A solid training set containing structured, semi-structured and
unstructured data, will help the machine to develop a
multidimensional view of the real world and the problem it is
engineered to resolve.
Drive innovation
• By studying and understanding the relationships between humans
and their electronic devices as well as the manufacturers of these
devices, companies can develop improved and innovative products by
examining current product trends and meeting customer
expectations.
Importance of big data
(Cont.)
• The importance of big data doesn’t revolve around how much data
you have, but what you do with it. You can take data from any
source and analyze it to find answers that enable 1) cost reductions,
2) time reductions, 3) new product development and optimized
offerings, and 4) smart decision making”. - SAS
The functioning of big data
• There are three important actions required to gain insights from big
data:
• Integration
• Management
• Analysis
Integration
• The traditional data integration methods such as ETL (Extract,
Transform, Load) are incapable of collating data from a wide variety of
unrelated sources and applications that are you at the heart of big
data.
• Advanced tools and technologies are required to analyze big data sets
that are exponentially larger than traditional data sets.
• By integrating big data from these disparate sources, companies are
able to analyze and extract valuable insight to grow and maintain
their businesses.
Management
• Big data management can be defined as “the organization, administration,
and governance of large volumes of both structured and unstructured data”.
• Big data requires efficient and cheap storage, which can be accomplished
using servers that are on-premises, cloud-based or a combination of both.
• Companies are able to seamlessly access required data from anywhere across
the world and then processing this is a data using required processing engines
on an as-needed basis.
• The goal is to make sure the quality of the data is high-level and can be
accessed easily by the required tools and applications.
• Big data gathered from all kinds of Dale sources including social media
platforms, search engine history and call logs.
Management (Cont.)
• The big data usually contain large sets of unstructured data and the semi-structured
data which are stored in a variety of formats.
• To be able to process and store this complicated data, companies require more
powerful and advanced data management software beyond the traditional relational
databases and data warehouse platforms.
• New platforms are available in the market that can combine big data with the
traditional data warehouse systems in a "logical data warehousing architecture”. As
part of this effort, companies are required to make decisions on what data must be
secured for regulatory purposes and compliance, what data must be kept for future
analytical purposes and what data has no future use and can be disposed of.
• This process is called "data classification”, which allows a rapid and efficient analysis
of a subset of data to be included in the immediate decision-making process of the
company.
Analysis
• Once the big data has been collected and is easily accessible, it can be
analyzed using advanced analytical tools and technologies.
• This analysis will provide valuable insight and actionable information.
• Big data can be explored to make discoveries and develop data
models using artificial intelligence and machine learning algorithms.
Big Data Analytics
• The terms of big data and big data analytics are often used
interchangeably owing to the fact that the inherent purpose of big data
is to be analyzed.
• "Big data analytics" can be defined as a set of qualitative and
quantitative methods that can be employed to examine a large amount
of unstructured, structured, and semi-structured data to discover data
patterns and valuable hidden insights.
• Big data analytics is the science of analyzing big data to collect metrics,
key performance indicators and Data trends that can be easily lost in
the flood of raw data, buy using machine learning algorithms and
automated analytical techniques.
Big Data Analytics
(Cont.)
• The different steps involved in "big data analysis" are:
• Gathering Data Requirements
• Cleaning the data
• Analyzing the data
Gathering Data Requirements
• It is important to understand what information or data needs to be
gathered to meet the business objective and goals.
• Data organization is also very critical for efficient and accurate data
analysis. Some of the categories in which the data can be organized
are gender, age, demographics, location, ethnicity, and income.
• A decision must also be made on the required data types (qualitative
and quantitative) and data values (can be numerical or
alphanumerical) to be used for the analysis.
Gathering Data
• Raw data can be collected from disparate sources such as social
media platforms, computers, cameras, other software applications,
company websites, and even third-party data providers.
• The big data analysis inherently requires large volumes of data, the
majority of which is unstructured with a limited amount of structured
and semistructured data.
Data organization and
categorization
• Depending on the company's infrastructure Data organization could
be done on a simple Excel spreadsheet or using and man tools and
applications that are capable of processing statistical data.
• Data must be organized and categorized based on data requirements
collected in step one of the big data analysis process.
Cleaning the data
• to perform the big data analysis sufficiently and rapidly, it is very
important to make sure the data set is void of any redundancy and
errors.
• Only a complete data set fulfilling the Data requirements must have
proceeded to the final analysis step.
• Preprocessing of data is required to make sure the only high-quality
data is being analyzed and company resources are being put to good
use.
Cleaning the data (Cont.)
• “Big data is high-volume, and high-velocity and/or high-variety
information assets that demand cost-effective, innovative forms of
information processing that enable enhanced insight, decision
making, and process automation”.
• - Gartner
Analyzing the data
• Depending on the insight that is expected to be achieved by the
completion of the analysis, any of the following four different types of
big data analytics approach can be adopted:
• Predictive analysis
• Prescriptive analysis
• Descriptive analysis
• Diagnostic analysis
• The big data analysis can be conducted using one or more of the tools listed
below:
• Hadoop – Open-source data framework.
• Python – Programming language widely used for machine learning
• SAS – Advanced analytical tool used primarily for big data analysis.
• Tableau – Artificial intelligence-based tool used primarily for data
visualization.
• SQL – the Programming language used to extract data from relational
databases.
• Splunk – Analytical tool used to categorize machine generated data
• R-programming – the Programming language used primarily for statistical
computing.
Applications of Big Data Analytics
• The big data analytics is involved in every business centralize on quick and
agile decisions to stay competitive.
• Some of the different types of organizations that can use big data analytics
are:
• Education industry
• Healthcare
• Travel Industry
• Finance
• Manufacturing
• Retail
• Life sciences
Big Data Analysis Vs. Data
Visualization
• In the wider data community, data analysis and data visualization are
increasingly being used synonymously. Professional data analysts are
expected to be able to skillfully represent data using visual tools and
formats.
• On the other hand, new professional job positions called "Data
visualization expert" and "data artist" have hit the market.
• But companies stool need professionals to analyze their data and extract
valuable insights from it.
• As you have learned by now, Data analysis or big data analysis is an
"exploratory process" with defined goals and specific questions that
need to be answered from a given set of big data.
• Data visualization pertains to the visual representation of data, using tools
as simple as an Excel spreadsheet or as advanced as dashboards created
using Tableau. Business executives are always short on time and need to
capture a whole lot of details.
• Therefore, the data analyst is required to use effective visualizations that
can significantly lower the amount of time needed to understand the
presented data and gather valuable insights from the data.
• By developing a variety of visual presentations from the data, an analyst can
view the data from different perspectives and identify potential data trends,
outliers, gaps, and anything that stands out and warrants further analysis.
• This process is referred to as "visual analytics”.
• Some of the widely used visual representations of the data are
"dashboard reports”, "infographics”, and "data story”.
• These visual representations are considered as the final deliverable
from the big data analysis process but in reality, they frequently serve
as a starting point for future political activities.
• The two completely different activities of data visualization and big
data analysis are inherently related and loop into each other by
serving as a starting point for as well as the endpoint of the other
activity.