Data Analytics Writeup
Data Analytics Writeup
Introduction:
What is data science? What is big data? What do these terms mean and why is it important to
find out? These topics are often misunderstood. Further, the industries involved don’t have
universally agreed upon definitions for both.
These are extremely important fields and concepts that are becoming increasingly critical. The
world has never collected or stored as much data, and as fast as it does today. In addition, the
variety and volume of data is growing at an alarming rate. Data is analogous to gold in many
ways. It is extraordinarily valuable and has many uses, but you often have to pan for it in order to
realize its value.
There are many debates as to whether data science is a new field. Many argue that similar
practices have been used and branded as statistics, analytics, business intelligence, and so forth.
In either case, data science is a very popular and prominent term used to describe many different
data-related processes and techniques. Big data on the other hand is relatively new in the sense
that the amount of data collected and the associated challenges continues to require new and
innovative hardware and techniques for handling it.
Data Science:
Data Science is a field or domain which includes and involves working with a huge amount of
data and uses it for building predictive, prescriptive and prescriptive analytical models. It’s about
digging, capturing, (building the model) analysing (validating the model) and utilizing the data
(deploying the best model). It is an intersection of Data and computing. It is a blend of the field
of Computer Science, Business and Statistics together.
Big Data:
It is huge, large or voluminous data, information or the relevant statistics acquired by the large
organizations and ventures. Many software and data storage created and prepared as it is difficult
to compute the big data manually.
It is used to discover patterns and trends and make decisions related to human behaviour and
interaction technology.
Types of Data and Data Sets:
A dataset can be one of the various different types. Based on data storage and structure, dataset
type is distinguished.
File-Based Datasets is a complete dataset stored within a single file. File-based datasets
usually have some method of assigning data to different categories.
Folder-Based Datasets are something like the dataset is the folder or folder holding the
data. MapInfo TAB, and CSV formats are examples of folder-based datasets. Here data is
as a series of files. Commonly, each differently named file is a feature type within the
dataset.
Database Datasets are sets of data stored within a database. Generally, each different
database is a different dataset. The most common example is an Oracle database. It will
be treated the same way whether it is spatial or non-spatial. Every different table within
the database is treated as a feature type.
Web Dataset is a collection of data stored on an Internet site. In this case, the name of
the dataset is the same as the name of the URL (Universal Resource Locator). A Web
Feature Service (WFS) server is an example of this. Web datasets usually have a number
of layers. Each layer signifies a different feature type.
Generally data sets can hold information in form of some records, to be used by a program
running on the system. Data sets are also used to store information needed by applications or the
operating system itself, such as source programs, macro libraries, or system variables or
parameters. Datasets can be Structured (RDBMS, Excel, Text-Delim ,csv), Semi-Structured
(Json , XML) or Unstructured (Text , Raw Images).
Data mining is an area of Data Science where the large data sets will be thoroughly processed to
provide with suitable results in the search by identifying different patterns. Data Mining is used
to find patterns, anomalies, and correlation in the large dataset to make the predictions using
broad range of techniques; this extracted information is used by the organization to increase their
revenue, cost-cutting reducing risk, improving customer relationship, etc. Data mining
technologies also include neural networks, statistical analysis, decision trees, genetic
algorithms, fuzzy logic, text mining, web mining etc.
Data Visualization is the process of displaying visual information out of the existing complex
data to draw a particular conclusion at a glance without the need of studying any theoretical
results. The applications include satellite data information, research results information,
scientifically studied data etc. Data Visualization has seven stages which are acquiring process,
parsing, filtering, mining, representing, refining and interacting.
Types of Data Analysis:
Descriptive and Inferential Analysis: These Statistical Analysis methods show "What
happened?" by using past data in the form of dashboards. Statistical Analysis includes collection,
Analysis, interpretation, presentation, and modelling of data. It analyses a set of data or a sample
of data. There are two categories of this type of Analysis - Descriptive Analysis and Inferential
Analysis.
Descriptive Analysis analyses complete data or a sample of summarized numerical data. It shows
mean and deviation for continuous data whereas percentage and frequency for categorical data.
Inferential Analysis analyses sample from complete data. In this type of Analysis, you can find
different conclusions from the same data by selecting different samples.
Diagnostic Analysis: Diagnostic Analysis shows "Why did it happen?" by finding the cause
from the insight found in Statistical Analysis. This Analysis is useful to identify behaviour
patterns of data. If a new problem arrives in your business process, then you can look into this
Analysis to find similar patterns of that problem. And it may have chances to use similar
prescriptions for the new problems.
Predictive Analysis: Predictive Analysis shows "what is likely to happen" by using previous
data. The simplest data analysis example is like if last year I bought two dresses based on my
savings and if this year my salary is increasing double then I can buy four dresses. But of course
it's not easy like this because you have to think about other circumstances like chances of prices
of clothes is increased this year or maybe instead of dresses you want to buy a new bike, or you
need to buy a house!
So here, this Analysis makes predictions about future outcomes based on current or past data.
Forecasting is just an estimate. Its accuracy is based on how much detailed information you have
and how much you dig in it.
Prescriptive Analysis: Prescriptive Analysis combines the insight from all previous Analysis to
determine which action to take in a current problem or decision. Most data-driven companies are
utilizing Prescriptive Analysis because predictive and descriptive Analysis are not enough to
improve data performance. Based on current situations and problems, they analyse the data and
make decisions.
Artificial Intelligence and Machine Learning
According to Stanford Researcher, John McCarthy,
The field has a long history rooted in military science and statistics, with contributions from
philosophy, psychology, math and cognitive science. Artificial intelligence originally set out to
make computers more useful and more capable of independent reasoning.
Most historians trace the birth of AI to a Dartmouth research project in 1956 that explored topics
like problem solving and symbolic methods. In the 1960s, the US Department of Defence took
interest in this type of work and increased the focus on training computers to mimic human
reasoning.
Machine learning automates analytical model building. It uses methods from neural
networks, statistics, operations research and physics to find hidden insights in data
without being explicitly programmed where to look or what to conclude.
A neural network is a kind of machine learning inspired by the workings of the human
brain. It’s a computing system made up of interconnected units (like neurons) that
processes information by responding to external inputs, relaying information between
each unit. The process requires multiple passes at the data to find connections and derive
meaning from undefined data.
Deep learning uses huge neural networks with many layers of processing units, taking
advantage of advances in computing power and improved training techniques to learn
complex patterns in large amounts of data. Common applications include image and
speech recognition.
Computer vision relies on pattern recognition and deep learning to recognize what’s in a
picture or video. When machines can process, analyze and understand images, they can
capture images or videos in real time and interpret their surroundings.
Artificial Intelligence applies machine learning, deep learning and other techniques to solve
actual problems.
A machine learning algorithm, also called model, is a mathematical expression that represents
data in the context of a problem, often a business problem. The aim is to go from data to insight.
For example, if an online retailer wants to anticipate sales for the next quarter, they might use a
machine learning algorithm that predicts those sales based on past sales and other relevant data.
Similarly, a windmill manufacturer might visually monitor important equipment and feed the
video data through algorithms trained to identify dangerous cracks.
1. Regression
2. Classification
3. Clustering
4. Dimensionality Reduction
5. Ensemble Methods
6. Neural Nets and Deep Learning
Regression:
Regression methods fall within the category of supervised ML. They help to predict or explain a
particular numerical value based on a set of prior data, for example predicting the price of a
property based on previous pricing data for similar properties.
The simplest method is linear regression where we use the equation of the line (y = m*x + b) to
model a data set. We train a linear regression model with many data pairs (x, y) by calculating
the position and slope of a line that minimizes the total distance between all of the data points
and the line. In other words, we calculate the slope (m) and the y-intercept (b) for a line that best
approximates the observations in the data.
Classification:
Another class of supervised ML, classification methods predict or explain a class value. For
example, they can help predict whether or not an online customer will buy a product. The output
can be yes or no: buyer or not buyer. But classification methods aren’t limited to two classes. For
example, a classification method could help to assess whether a given image contains a car or a
truck. In this case, the output will be 3 different values: 1) the image contains a car, 2) the image
contains a truck, or 3) the image contains neither a car nor a truck.
Clustering:
With clustering methods, we get into the category of unsupervised ML because their goal is to
group or cluster observations that have similar characteristics. Clustering methods don’t use
output information for training, but instead let the algorithm define the output. In clustering
methods, we can only use visualizations to inspect the quality of the solution.
Dimensionality Reduction:
As the name suggests, we use dimensionality reduction to remove the least important
information (sometime redundant columns) from a data set. In practice, I often see data sets with
hundreds or even thousands of columns (also called features), so reducing the total number is
vital. For instance, images can include thousands of pixels, not all of which matter to your
analysis. Or when testing microchips within the manufacturing process, you might have
thousands of measurements and tests applied to every chip, many of which provide redundant
information. In these cases, you need dimensionality reduction algorithms to make the data set
manageable.
Ensemble Methods:
Ensemble methods use this idea of combining several predictive models (supervised ML) to get
higher quality predictions than each of the models could provide on its own. For example, the
Random Forest algorithms is an ensemble method that combines many Decision Trees trained
with different samples of the data sets. As a result, the quality of the predictions of a Random
Forest is higher than the quality of the predictions estimated with a single Decision Tree.
In contrast to linear and logistic regressions which are considered linear models, the objective of
neural networks is to capture non-linear patterns in data by adding layers of parameters to the
model. In the image below, the simple neural net has three inputs, a single hidden layer with five
parameters, and an output layer.
In fact, the structure of neural networks is flexible enough to build our well-known linear and
logistic regression. The term Deep learning comes from a neural net with many hidden layers
and encapsulates a wide variety of architectures.
It’s especially difficult to keep up with developments in deep learning, in part because the
research and industry communities have doubled down on their deep learning efforts, spawning
whole new methodologies every day.
For the best performance, deep learning techniques require a lot of data — and a lot of compute
power since the method is self-tuning many parameters within huge architectures. It quickly
becomes clear why deep learning practitioners need very powerful computers enhanced with
GPUs (graphical processing units).
The cornerstone of an advanced manufacturing plant is the central control system. Since time is
the critical element in its operation, a time-series database offers by far the best route to
providing this required precision. The implementation of an Industry 4.0 manufacturing plant
requires an adherence to data standards that can ensure the seamless flow of information.
In operation, the process applications backed by a time-series database deliver two critical
services: keeping the production line running efficiently and minimizing downtime. Although
these may sound like one and the same thing, they are very different in practice.
The efficiency of the production line comes down to the control and sequencing of events in the
manufacturing process. This control requires the ingestion of huge amounts of data from a
massively augmented array of sensors so that real-time instructions can be delivered to the cyber-
physical systems and other aspects of the line. This necessitates a shift from legacy backend
systems that derive from the era of independent OT/IT systems and the implementation of a
time-series database architecture to accommodate the scale and precision required.
The minimization of the production line downtime is ensured through the analytics of the data to
predict problems and equipment failures before they actually occur. Through this predictive
failure analysis, the problems can be forestalled and actions taken to eliminate the risk of an
unscheduled stoppage.
Ability to deliver precision monitoring of events with the potential to go down to the
nano-second
Monitoring across multiple data sources
Providing a context on the data so that, for example, the huge volumes of high precision
data might be kept for short periods while low precision data would be kept for longer or
indefinitely
Another key aspect is the handling of the manufacturing data and the need for scalability and
open exchange. The data generated from a manufacturing environment can be highly variable
and unpredictable in its volume. The core time-series database needs to be able to both ingest the
high throughput of data and sustain the real-time querying. If either of these fail, then the
operational integrity of the production line could be compromised.
The open exchange of data is critical to the smooth operation of any advanced manufacturing
process, but is critical in an Industry 4.0 environment.
Industry 4.0 big data comes from many and diverse sources:
In 2016 PwC conducted a global survey on the state of the adoption of Industry 4.0 across a wide
range of industry sectors including aerospace, defense and security, automotive, electronics, and
industrial manufacturing. On average, the respondents expected that by 2020 Industry 4.0
implementations, including big data analytics, would reduce their production and operation costs
by 3.6%.
What follows are some selected real-life examples of how the Industry 4.0 big data vision can
bring measurable value to manufacturers:
Reduced downtime: Applicable to many industrial sectors, Industry 4.0 big data
analytics can uncover patterns that predict machine or process failures before they occur.
Machine supervisors will be able to assess process or machine performance in real time
and, in many cases, prevent unplanned downtime.