0% found this document useful (0 votes)

4 views12 pages

Data Analytics Writeup

The document discusses the significance of data science, big data, and machine learning, emphasizing their growing importance in today's data-driven world. It explains various concepts, including types of data, data mining, visualization, and different analytical methods, while also detailing machine learning techniques and their applications. Additionally, it highlights the role of data in Industry 4.0, focusing on the necessity of time-series databases for efficient manufacturing processes.

Uploaded by

Rudransh Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views12 pages

Data Analytics Writeup

Uploaded by

Rudransh Srivastava

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

DATA SCIENCE, BIG DATA AND MACHINE LEARNING

Introduction:

What is data science? What is big data? What do these terms mean and why is it important to
find out? These topics are often misunderstood. Further, the industries involved don’t have
universally agreed upon definitions for both.

These are extremely important fields and concepts that are becoming increasingly critical. The
world has never collected or stored as much data, and as fast as it does today. In addition, the
variety and volume of data is growing at an alarming rate. Data is analogous to gold in many
ways. It is extraordinarily valuable and has many uses, but you often have to pan for it in order to
realize its value.

There are many debates as to whether data science is a new field. Many argue that similar
practices have been used and branded as statistics, analytics, business intelligence, and so forth.
In either case, data science is a very popular and prominent term used to describe many different
data-related processes and techniques. Big data on the other hand is relatively new in the sense
that the amount of data collected and the associated challenges continues to require new and
innovative hardware and techniques for handling it.

Data Science:

Data Science is a field or domain which includes and involves working with a huge amount of
data and uses it for building predictive, prescriptive and prescriptive analytical models. It’s about
digging, capturing, (building the model) analysing (validating the model) and utilizing the data
(deploying the best model). It is an intersection of Data and computing. It is a blend of the field
of Computer Science, Business and Statistics together.

Big Data:

It is huge, large or voluminous data, information or the relevant statistics acquired by the large
organizations and ventures. Many software and data storage created and prepared as it is difficult
to compute the big data manually.

It is used to discover patterns and trends and make decisions related to human behaviour and
interaction technology.
Types of Data and Data Sets:

A dataset can be one of the various different types. Based on data storage and structure, dataset
type is distinguished.

 File-Based Datasets is a complete dataset stored within a single file. File-based datasets
usually have some method of assigning data to different categories.

 Folder-Based Datasets are something like the dataset is the folder or folder holding the
data. MapInfo TAB, and CSV formats are examples of folder-based datasets. Here data is
as a series of files. Commonly, each differently named file is a feature type within the
dataset.

 Database Datasets are sets of data stored within a database. Generally, each different
database is a different dataset. The most common example is an Oracle database. It will
be treated the same way whether it is spatial or non-spatial. Every different table within
the database is treated as a feature type.

 Web Dataset is a collection of data stored on an Internet site. In this case, the name of
the dataset is the same as the name of the URL (Universal Resource Locator). A Web
Feature Service (WFS) server is an example of this. Web datasets usually have a number
of layers. Each layer signifies a different feature type.

Generally data sets can hold information in form of some records, to be used by a program
running on the system. Data sets are also used to store information needed by applications or the
operating system itself, such as source programs, macro libraries, or system variables or
parameters. Datasets can be Structured (RDBMS, Excel, Text-Delim ,csv), Semi-Structured
(Json , XML) or Unstructured (Text , Raw Images).

Data Mining and Visualization:

Data mining is an area of Data Science where the large data sets will be thoroughly processed to
provide with suitable results in the search by identifying different patterns. Data Mining is used
to find patterns, anomalies, and correlation in the large dataset to make the predictions using
broad range of techniques; this extracted information is used by the organization to increase their
revenue, cost-cutting reducing risk, improving customer relationship, etc. Data mining
technologies also include neural networks, statistical analysis, decision trees, genetic
algorithms, fuzzy logic, text mining, web mining etc.

Data Visualization is the process of displaying visual information out of the existing complex
data to draw a particular conclusion at a glance without the need of studying any theoretical
results. The applications include satellite data information, research results information,
scientifically studied data etc. Data Visualization has seven stages which are acquiring process,
parsing, filtering, mining, representing, refining and interacting.
Types of Data Analysis:

Descriptive and Inferential Analysis: These Statistical Analysis methods show "What
happened?" by using past data in the form of dashboards. Statistical Analysis includes collection,
Analysis, interpretation, presentation, and modelling of data. It analyses a set of data or a sample
of data. There are two categories of this type of Analysis - Descriptive Analysis and Inferential
Analysis.
Descriptive Analysis analyses complete data or a sample of summarized numerical data. It shows
mean and deviation for continuous data whereas percentage and frequency for categorical data.
Inferential Analysis analyses sample from complete data. In this type of Analysis, you can find
different conclusions from the same data by selecting different samples.

Diagnostic Analysis: Diagnostic Analysis shows "Why did it happen?" by finding the cause
from the insight found in Statistical Analysis. This Analysis is useful to identify behaviour
patterns of data. If a new problem arrives in your business process, then you can look into this
Analysis to find similar patterns of that problem. And it may have chances to use similar
prescriptions for the new problems.
Predictive Analysis: Predictive Analysis shows "what is likely to happen" by using previous
data. The simplest data analysis example is like if last year I bought two dresses based on my
savings and if this year my salary is increasing double then I can buy four dresses. But of course
it's not easy like this because you have to think about other circumstances like chances of prices
of clothes is increased this year or maybe instead of dresses you want to buy a new bike, or you
need to buy a house!
So here, this Analysis makes predictions about future outcomes based on current or past data.
Forecasting is just an estimate. Its accuracy is based on how much detailed information you have
and how much you dig in it.

Prescriptive Analysis: Prescriptive Analysis combines the insight from all previous Analysis to
determine which action to take in a current problem or decision. Most data-driven companies are
utilizing Prescriptive Analysis because predictive and descriptive Analysis are not enough to
improve data performance. Based on current situations and problems, they analyse the data and
make decisions.
Artificial Intelligence and Machine Learning
According to Stanford Researcher, John McCarthy,

“Artificial Intelligence is the science and engineering of making intelligent machines,

especially intelligent computer programs. Artificial Intelligence is related to the
similar task of using computers to understand human intelligence, but AI does not
have to confine itself to methods that are biologically observable.”

The field has a long history rooted in military science and statistics, with contributions from
philosophy, psychology, math and cognitive science. Artificial intelligence originally set out to
make computers more useful and more capable of independent reasoning.

Most historians trace the birth of AI to a Dartmouth research project in 1956 that explored topics
like problem solving and symbolic methods. In the 1960s, the US Department of Defence took
interest in this type of work and increased the focus on training computers to mimic human
reasoning.

As a whole, artificial intelligence contains many subfields, including:

 Machine learning automates analytical model building. It uses methods from neural
networks, statistics, operations research and physics to find hidden insights in data
without being explicitly programmed where to look or what to conclude.

 A neural network is a kind of machine learning inspired by the workings of the human
brain. It’s a computing system made up of interconnected units (like neurons) that
processes information by responding to external inputs, relaying information between
each unit. The process requires multiple passes at the data to find connections and derive
meaning from undefined data.

 Deep learning uses huge neural networks with many layers of processing units, taking
advantage of advances in computing power and improved training techniques to learn
complex patterns in large amounts of data. Common applications include image and
speech recognition.

 Computer vision relies on pattern recognition and deep learning to recognize what’s in a
picture or video. When machines can process, analyze and understand images, they can
capture images or videos in real time and interpret their surroundings.

 Natural language processing is the ability of computers to analyze, understand and

generate human language, including speech. The next stage of NLP is natural language
interaction, which allows humans to communicate with computers using normal,
everyday language to perform tasks.
While machine learning is based on the idea that machines should be able to learn and adapt
through experience, AI refers to a broader idea where machines can execute tasks "smartly."

Artificial Intelligence applies machine learning, deep learning and other techniques to solve
actual problems.

Three areas of Machine Learning:

 Supervised Learning: In supervised learning, training datasets are provided to the

system. Supervised learning algorithms analyse the data and produce an inferred
function. The correct solution thus produced can be used for mapping new examples.
Credit card fraud detection is one of the examples of Supervised Learning algorithm.

 Unsupervised Learning: Unsupervised Learning algorithms are much harder because

the data to be fed is unclustered instead of datasets. Here the goal is to have the machine
learn on its own without any supervision. The correct solution of any problem is not
provided. The algorithm itself finds the patterns in the data. One of the examples of
supervised learning is Recommendation engines which are there on all e-commerce sites
or also on Facebook friend request suggestion mechanism.

 Reinforcement Learning: This type of Machine Learning algorithms allows software

agents and machines to automatically determine the ideal behaviour within a specific
context, to maximise its performance. Reinforcement learning is defined by
characterising a learning problem and not by characterising learning methods. Any
method which is well suited to solve the problem, we consider it to be the reinforcement
learning method. Reinforcement learning assumes that a software agent i.e. a robot, or a
computer program or a bot, connect with a dynamic environment to attain a definite goal.
This technique selects the action that would give expected output efficiently and rapidly.

Major Machine Learning Techniques:

A machine learning algorithm, also called model, is a mathematical expression that represents
data in the context of a problem, often a business problem. The aim is to go from data to insight.
For example, if an online retailer wants to anticipate sales for the next quarter, they might use a
machine learning algorithm that predicts those sales based on past sales and other relevant data.
Similarly, a windmill manufacturer might visually monitor important equipment and feed the
video data through algorithms trained to identify dangerous cracks.

The major machine learning techniques commonly used are:

1. Regression
2. Classification
3. Clustering
4. Dimensionality Reduction
5. Ensemble Methods
6. Neural Nets and Deep Learning

Regression:
Regression methods fall within the category of supervised ML. They help to predict or explain a
particular numerical value based on a set of prior data, for example predicting the price of a
property based on previous pricing data for similar properties.

The simplest method is linear regression where we use the equation of the line (y = m*x + b) to
model a data set. We train a linear regression model with many data pairs (x, y) by calculating
the position and slope of a line that minimizes the total distance between all of the data points
and the line. In other words, we calculate the slope (m) and the y-intercept (b) for a line that best
approximates the observations in the data.

Classification:

Another class of supervised ML, classification methods predict or explain a class value. For
example, they can help predict whether or not an online customer will buy a product. The output
can be yes or no: buyer or not buyer. But classification methods aren’t limited to two classes. For
example, a classification method could help to assess whether a given image contains a car or a
truck. In this case, the output will be 3 different values: 1) the image contains a car, 2) the image
contains a truck, or 3) the image contains neither a car nor a truck.
Clustering:

With clustering methods, we get into the category of unsupervised ML because their goal is to
group or cluster observations that have similar characteristics. Clustering methods don’t use
output information for training, but instead let the algorithm define the output. In clustering
methods, we can only use visualizations to inspect the quality of the solution.

Dimensionality Reduction:

As the name suggests, we use dimensionality reduction to remove the least important
information (sometime redundant columns) from a data set. In practice, I often see data sets with
hundreds or even thousands of columns (also called features), so reducing the total number is
vital. For instance, images can include thousands of pixels, not all of which matter to your
analysis. Or when testing microchips within the manufacturing process, you might have
thousands of measurements and tests applied to every chip, many of which provide redundant
information. In these cases, you need dimensionality reduction algorithms to make the data set
manageable.

Ensemble Methods:

Ensemble methods use this idea of combining several predictive models (supervised ML) to get
higher quality predictions than each of the models could provide on its own. For example, the
Random Forest algorithms is an ensemble method that combines many Decision Trees trained
with different samples of the data sets. As a result, the quality of the predictions of a Random
Forest is higher than the quality of the predictions estimated with a single Decision Tree.

Neural Networks and Deep Learning:

In contrast to linear and logistic regressions which are considered linear models, the objective of
neural networks is to capture non-linear patterns in data by adding layers of parameters to the
model. In the image below, the simple neural net has three inputs, a single hidden layer with five
parameters, and an output layer.

In fact, the structure of neural networks is flexible enough to build our well-known linear and
logistic regression. The term Deep learning comes from a neural net with many hidden layers
and encapsulates a wide variety of architectures.

It’s especially difficult to keep up with developments in deep learning, in part because the
research and industry communities have doubled down on their deep learning efforts, spawning
whole new methodologies every day.

For the best performance, deep learning techniques require a lot of data — and a lot of compute
power since the method is self-tuning many parameters within huge architectures. It quickly
becomes clear why deep learning practitioners need very powerful computers enhanced with
GPUs (graphical processing units).

The Role of Data in Industry 4.0:

The cornerstone of an advanced manufacturing plant is the central control system. Since time is
the critical element in its operation, a time-series database offers by far the best route to
providing this required precision. The implementation of an Industry 4.0 manufacturing plant
requires an adherence to data standards that can ensure the seamless flow of information.

In operation, the process applications backed by a time-series database deliver two critical
services: keeping the production line running efficiently and minimizing downtime. Although
these may sound like one and the same thing, they are very different in practice.
The efficiency of the production line comes down to the control and sequencing of events in the
manufacturing process. This control requires the ingestion of huge amounts of data from a
massively augmented array of sensors so that real-time instructions can be delivered to the cyber-
physical systems and other aspects of the line. This necessitates a shift from legacy backend
systems that derive from the era of independent OT/IT systems and the implementation of a
time-series database architecture to accommodate the scale and precision required.

The minimization of the production line downtime is ensured through the analytics of the data to
predict problems and equipment failures before they actually occur. Through this predictive
failure analysis, the problems can be forestalled and actions taken to eliminate the risk of an
unscheduled stoppage.

The role of a time-series database is to deliver three aspects:

 Ability to deliver precision monitoring of events with the potential to go down to the
nano-second
 Monitoring across multiple data sources
 Providing a context on the data so that, for example, the huge volumes of high precision
data might be kept for short periods while low precision data would be kept for longer or
indefinitely

Another key aspect is the handling of the manufacturing data and the need for scalability and
open exchange. The data generated from a manufacturing environment can be highly variable
and unpredictable in its volume. The core time-series database needs to be able to both ingest the
high throughput of data and sustain the real-time querying. If either of these fail, then the
operational integrity of the production line could be compromised.
The open exchange of data is critical to the smooth operation of any advanced manufacturing
process, but is critical in an Industry 4.0 environment.

Industry 4.0 big data comes from many and diverse sources:

 Product and/or machine design data such as threshold specifications

 Machine-operation data from control systems
 Product- and process-quality data
 Records of manual operations carried out by staff
 Manufacturing execution systems
 Information on manufacturing and operational costs
 Fault-detection and other system-monitoring deployments
 Logistics information including third-party logistics
 Customer information on product usage, feedback, and more
Example Use Cases of Big Data in Industry 4.0:

In 2016 PwC conducted a global survey on the state of the adoption of Industry 4.0 across a wide
range of industry sectors including aerospace, defense and security, automotive, electronics, and
industrial manufacturing. On average, the respondents expected that by 2020 Industry 4.0
implementations, including big data analytics, would reduce their production and operation costs
by 3.6%.

What follows are some selected real-life examples of how the Industry 4.0 big data vision can
bring measurable value to manufacturers:

 Merging quality and production data to improve production quality:

A semiconductor manufacturer began correlating single-chip data captured in the testing
phase at the end of the production process with process data collected earlier in the
process. The manufacturer could then identify faulty chips early on and greatly improve
the quality of the production process.

 Empowered customers: The automotive industry is enthusiastically embracing Industry

4.0 in order to cost-effectively meet consumer expectations for more affordable and
digitally connected cars. Among the many use cases of the big data that will be generated
by connected cars is the seamless exchange of data with the manufacturer. In addition to
improving after-sale service for the individual car-owner, the aggregated information on
car performance can be used to improve quality processes and future designs.

 Reduced downtime: Applicable to many industrial sectors, Industry 4.0 big data
analytics can uncover patterns that predict machine or process failures before they occur.
Machine supervisors will be able to assess process or machine performance in real time
and, in many cases, prevent unplanned downtime.

(Synthesis Lectures On Artificial Intelligence and Machine Learning) Philip Osborne, Kajal Singh, Matthew E. Taylor - Applying Reinforcement Learning On Real-World Data With Practical Examples in Pyth
No ratings yet
(Synthesis Lectures On Artificial Intelligence and Machine Learning) Philip Osborne, Kajal Singh, Matthew E. Taylor - Applying Reinforcement Learning On Real-World Data With Practical Examples in Pyth
105 pages
Pre-Installation Checklist Vanquish 3.1 (EN)
No ratings yet
Pre-Installation Checklist Vanquish 3.1 (EN)
7 pages
2023 PX 600 Datasheet v5
No ratings yet
2023 PX 600 Datasheet v5
2 pages
ECE Course Structure and Syllabus 2007-08
No ratings yet
ECE Course Structure and Syllabus 2007-08
83 pages
4 09 1 Control of Nonconforming Work
No ratings yet
4 09 1 Control of Nonconforming Work
5 pages
GITAM BTECH Exam Hall Ticket 2024
No ratings yet
GITAM BTECH Exam Hall Ticket 2024
1 page
E300 Configuration and Wiring Guide
No ratings yet
E300 Configuration and Wiring Guide
7 pages
Shell Scheme Standard
No ratings yet
Shell Scheme Standard
1 page
Counterdrone
No ratings yet
Counterdrone
11 pages
Literature Review Cafe
100% (1)
Literature Review Cafe
8 pages
4.05 Further Algebra
No ratings yet
4.05 Further Algebra
9 pages
Antons Et Al 2021 Computational Literature Reviews Method Algorithms and Roadmap
No ratings yet
Antons Et Al 2021 Computational Literature Reviews Method Algorithms and Roadmap
32 pages
CSC230 FinalExam Review Sheet2025
No ratings yet
CSC230 FinalExam Review Sheet2025
3 pages
Industry-Scale Knowledge Graphs: Lessons and Challenges
No ratings yet
Industry-Scale Knowledge Graphs: Lessons and Challenges
28 pages
Diesel
100% (1)
Diesel
9 pages
OREA Real Estate College Student Handbook
100% (1)
OREA Real Estate College Student Handbook
97 pages
Marketing Crossword Puzzle - WordMint New
100% (2)
Marketing Crossword Puzzle - WordMint New
2 pages
WasteLess Project Presentation
No ratings yet
WasteLess Project Presentation
16 pages
GFF - Jackals v2.50
No ratings yet
GFF - Jackals v2.50
2 pages
Artifical Intelligence Class 10th
No ratings yet
Artifical Intelligence Class 10th
193 pages
Time
No ratings yet
Time
5 pages
Data Sheet For CMVA55
No ratings yet
Data Sheet For CMVA55
2 pages
Overtime Accomplishment
No ratings yet
Overtime Accomplishment
9 pages
This Document Has Been Prepared by Sunder Kidambi With The Blessings of
No ratings yet
This Document Has Been Prepared by Sunder Kidambi With The Blessings of
3 pages
ESD StreamEngineManual
No ratings yet
ESD StreamEngineManual
26 pages
BSBINM301 Organise Workplace Information
No ratings yet
BSBINM301 Organise Workplace Information
73 pages
DB 1 LB 105 1
No ratings yet
DB 1 LB 105 1
1 page
Lab 8.5.3: Troubleshooting Enterprise Networks 3: Topology Diagram
No ratings yet
Lab 8.5.3: Troubleshooting Enterprise Networks 3: Topology Diagram
11 pages
Balance Microphone Pre-Amp Rod Elliot MOD
No ratings yet
Balance Microphone Pre-Amp Rod Elliot MOD
1 page