PDF Data Science
PDF Data Science
M. SC. (IT)
SEMESTER - I
REVISED SYLLABUS AS PER NEP 2020
DATA SCIENCE
© UNIVERSITY OF MUMBAI
Prof. Ravindra Kulkarni
Vice Chancellor
University of Mumbai, Mumbai.
Prin. Dr. Ajay Bhamare Prof. Shivaji Sargar
Pro Vice-Chancellor, Director
University of Mumbai. CDOE, University of Mumbai.
Published by
Director
Institute of Distance and Open Learning, University of Mumbai,Vidyanagari, Mumbai - 400 098.
Module 1
Unit 1 Data Science Introduction & Basics
1a. Data Science Technology Stack, Business Layer & Utility Layer 1
1b. Layered Framework 19
Unit II Statistics for Data Science
2a. Three Management Layers 26
2b. Retrieve Super Step 37
2c. Assess Superstep 53
2c1. Assess Superstep 71
Module 2
Unit 3 Data Analysis with Python & Data Visualization
3a. Process Superstep 91
3b. Transform Superstep 113
Module 3
Unit 4 Machine Learning For Data Science
4a. Transform Superstep 146
4b. Organize And Report Supersteps 162
*****
Programme Code : ________________ Programme Name:M. Sc (Information Technology)
Course Code:501 Course Name: Data Science
Total Credits: 04 (60 Lecture Hrs) Total Marks: 100 marks
University assessment: 50 marks College/Department assessment: 50 marks
Pre requisite:
Basic understanding of statistics
Course Objectives (COs)
To enable the students to:
CO1 : Develop in depth understanding of the key technologies in data science and business
analytics: data mining, machine learning, visualization techniques, predictive modeling,
and statistics.
CO2 : Practice problem analysis and decision-making.
CO3 : Gain practical, hands-on experience with statistics programming languages and big data
tools through coursework and applied research experiences.
MODULE I: (2 CREDITS)
Unit 1: Data Science Introduction & Basics
a. Data Science Technology Stack: Rapid Information Factory
Ecosystem, Data Science Storage Tools, Data Lake, Data Vault, Data
Warehouse Bus Matrix, Data Science Processing Tools ,Spark, Mesos,
Akka , Cassandra, Kafka, Elastic Search, R ,Scala, Python, MQTT, The
Future.
b. Layered Framework: Definition of Data Science Framework, Cross- 15 Hrs
Industry Standard Process for Data Mining (CRISP-DM), Homogeneous [OC1, OC2,
Ontology for Recursive Uniform Schema, The Top Layers of a Layered OC3]
Framework, Layered Framework for High-Level Data Science and
Engineering
c. Business Layer: Business Layer, Engineering a Practical Business
Layer
d. Utility Layer: Basic Utility Design, Engineering a Practical Utility
Layer
Unit 2: Statistics for Data Science
a. Three Management Layers: Operational Management Layer,
Processing-Stream Definition and Management, Audit, Balance, and
Control Layer, Balance, Control, Yoke Solution, Cause-and-Effect,
Analysis System, Functional Layer, Data Science Process 15 Hrs
b. Retrieve Superstep: Data Lakes, Data Swamps, Training the Trainer [OC4, OC5,
Model, Understanding the Business Dynamics of the Data Lake, OC6]
Actionable Business Knowledge from Data Lakes, Engineering a
Practical Retrieve Superstep, Connecting to Other Data Sources.
c. Assess Superstep: Assess Superstep, Errors, Analysis of Data, Practical
Actions, Engineering a Practical Assess Superstep
MODULE II : (2 CREDITS)
Unit 3: Data Analysis with Python & Data Visualization
a. Process Superstep : Data Vault, Time-Person-Object-Location-Event 15 Hrs
Data Vault, Data Science Process, Data Science, [OC7, OC8,
b. Transform Superstep : Transform Superstep, Building a Data OC9, OC10]
Warehouse, Transforming with Data Science, Hypothesis Testing,
Overfitting and Underfitting, Precision-Recall, Cross-Validation Test.
Unit 4: Machine Learning for Data Science
a. Transform Superstep: Univariate Analysis, Bivariate Analysis,
Multivariate Analysis, Linear Regression, Logistic Regression,
Clustering Techniques, ANOVA, Principal Component Analysis (PCA),
15 Hrs
Decision Trees, Support Vector Machines, Networks, Clusters, and
[OC11,OC12,
Grids, Data Mining, Pattern Recognition, Machine Learning, Bagging
OC13, OC14]
Data,Random Forests, Computer Vision (CV) , Natural Language
Processing (NLP), Neural Networks, TensorFlow.
b. Organize and Report Supersteps : Organize Superstep, Report
Superstep, Graphics, Pictures, Showing the Difference
Course Outcomes(OCs)
Upon completing this course, the student will be able to:
1. Apply quantitative modeling and data analysis techniques to the solution of real world
business problems, communicate findings, and effectively present results using data
visualization techniques.
2. Recognize and analyze ethical issues in business related to intellectual property, data
security, integrity, and privacy.
3. Apply ethical practices in everyday business activities and make well-reasoned ethical
business and data management decisions.
4. Demonstrate knowledge of statistical data analysis techniques utilized in business decision
making.
5. Apply principles of Data Science to the analysis of business problems.
6. Use data mining software to solve real-world problems.
7. Employ cutting edge tools and technologies to analyze Big Data.
8. Apply algorithms to build machine intelligence.
9. Demonstrate use of team work, leadership skills, decision making and organization theory.
MODULE 1
Unit 1
1a
DATA SCIENCE TECHNOLOGY STACK,
BUSINESS LAYER & UTILITY LAYER
Unit Structure
1a.1 Introduction
1a.2 Business layer
1a.3 Utility layer
1a.2 Summary
1a.3 Unit End Questions
1a.4 References
1
Data Science ● Transform Super Step.
The transform super step converts data vault via sun modeling into
dimensional modeling to form a data warehouse.
2
● Relational Database Management System is used and designed to Data Science Technology
Stack, Business Layer &
store the data. Utility Layer
● To Retrieve the data from the relational database system, you need to
run the specific structure query language to perform these tasks.
● A traditional database management system only works with schema
and it will work once the schema is described and there is only one
point of view to describe and view the data into the database.
● It stores a dense of data and all the data are stored into the datastore
and schema on write widely use methodology to store the dense data.
● Schema on write schemas are build with the purpose which makes
them change and maintain the data into the database.
● When there is a lot of raw data which are available for the processing,
during, some of the data are lost and it makes them weak for future
analysis.
● If some important data are not stored into the database then you
cannot process the data for further data analysis.
● Schema on read generate the fresh and new data and increase the
speed of data generation as well as reduce the cycle time of data
availability of actionable information.
● These types of ecosystem that means schema on read and schema on
write are very useful and essential for data scientist and engineering
personal for better understanding about data preparation, modeling,
development, and deployment of data into the production.
● When you apply schema on read on structure, un-structure, and semi-
structure, it would generate very slow result because it does not have
the schema fast retrieval of data into the data warehouse.
3
Data Science ● Schema on read follow the agile way of working and it has
capabilities and potential to work likeNoSQL database as it works in
the environment.
● Some time schema on read through the error during the query time
because there are three type of data stored into the database like
structure, un-structure, and semi-structure. There is no better process
and rule and regulation for fast and better retrieval of data from
structure database.
Data Lake:
● A Data Lake is storage repository of large amount of raw data that
means structure, semi-structure, unstructured data.
● This is the place where you can store three types of data structure,
semi-structure, unstructured data with no fix amount of limit and
storage to store the data.
● If we compare schema on write and data lake then we will find that
schema on write store the data into the data warehouse in predefined
database on the other hand data lake store the less data structure to
store the data into the database.
● Data Lake follow to store less data into the structure database because
it follows the schema on read process architecture to store the data.
● Data Lake allow us to transform the raw data that means structure,
semi-structure, unstructured data into the structure data format so
that SQL query could be performed for the analysis.
● Most of the time data lake is deployed by using the distributed data
object storage database which enable the schema on read so that
business analytics and data mining tools and algorithms can be
applied on the data.
● Retrieval of data is so fast because there is no schema applied. Data
must be access without any failure or any complex reason.
● Data Lake is similar to real time river or lake where the water comes
from different- different places and at the last all the small-small river
and lake are merged into the big river or lake where large amount of
water are stored, whenever there is need of water then it can be used
by anyone.
It is low cost and effective way to store the large amount of data stored
into centralized database for further organizational analysis and
deployment.
4
Data Science Technology
Stack, Business Layer &
Utility Layer
Figure 1a.1
Data Vault:
● Data Vault is a database modeling method which is designed to store
the long-term historical storage amount of data and it can controlled
by using the data vault.
● In Data Vault, data must come from different sources and it is
designed in such a ways that data could be loaded in parallel ways so
that large amount of data implementation can be done without any
failure or any major design.
● Data Vault is the process of transforming the schema on read data
lake into schema on write data lake.
● Data Vault are designed schema on read query request and after that it
would be converted into the data lake because schema on read
increase the speed of generating new data for the better analysis and
implementation.
● Data Vault store a single version of data and does not distinguish
between good data and bad data.
● Data Lake and Data Vault are built by using the three main
component or structure of data i.e. Hub, Link and satellite.
2. Hub :
● Hub has unique business key with low amount of data to be changed
and meta data that means data is the main source of generating the
hubs.
Hub contains surrogate key for each metadata information and each hub
items i.e. origin of this business key.
5
Data Science ● Hub contains a set of unique business key that will never change over
a period manner.
● There are different types of hubs like person hub, time hub, object
hub, event hub, locations hub. The Time hub contains ID Number, ID
Time Number, Zone Basekey, DateTimekey, DateTimeValue and all
these links are interconnected to each other like Time-Person, Time-
Object, Time-Event, Time-Location, Time-Links etc.
● The Person hub contains IDPersonNumber, FirstName, SecondName,
LastName, Gender, TimeZone, BirthDateKey, BirthDate and all these
links are interconnected to each other like Person-Time, Person-
Object, Person-Location, Person-Event, Person-Link etc.
● The Object hub contains IDObjectNumber, ObjectBaseKey,
ObjectNumber, ObjectValue and all these links are interconnected to
each other like Object-Time, Object-Link, Object-Event, Object-
Location, Object-Person etc.
● The Event hub contains ID Event Number, Event Type, Event
Description and all these links are interconnected to each other like
Event-Person, Event-Location, Event-Object, Event-Time etc.
● The Location hub contains ID Location Number, Object Base Key,
Location Number, Location Name, Longitude and Latitude all these
links are interconnected to each other like Location-Person, Location-
Time, Location-Object, Location-event etc.
Link:
● Link plays a very important role during transaction and association of
business key. The Table relate to each other depending upon the
attribute of table like that one to one relationship, one to many
relationships, Many to One relationship, Many to many relationships.
● Link represents and connect only element in the business relationships
because when one node or link relate to one or another link on that
time data transfers smoothly.
Satellites:
● When the hubs and links produce and form the structure of satellites
which store no chronological structure of data means then it would
not provide the information about the mean, median, mode,
maximum, minimum, sum of the data.
6
engineer to store the business structure, types of information or data into Data Science Technology
Stack, Business Layer &
it. Utility Layer
Figure 1a.2
Data Science Processing Tools:
● The process of transforming the data, data lake to data vault and then
transferring the data vault into data warehouse.
● Most of the data scientist and data analysis, data engineer uses these
data science processing tool to process and transfer the data vault into
data warehouse.
1. Spark:
● Apache Spark is an open source clustering computing framework. The
word open source means it is freely available on internet and just go
on internet and type apache spark and you can get freely source code,
you can download and use according to your wish.
● Apache Spark was developed at AMP Lab of university of California,
Berkeley and after that all the code and data was donated to Apache
Software Foundation to keep doing changes over a time and make it
more effective, reliable, portable that will run on all the platform.
● Apache Spark provide an interface for the programmer and developer
to directly interact with the system and make data parallel and
compatible with data scientist and data engineer.
● Apache Spark has the capabilities and potential, process all types and
variety of data with repositories including Hadoop distributed file
system, NoSQL database as well as apache spark.
IBM are hiring most of the data scientist and data engineer, who has more
knowledge and information about apache spark project so that innovation
could perform an easy way and will come up with more feature and
changing.
7
Data Science ● Apache Spark has potential and capabilities to process the data very
fast and hold the data in memory and transfer the data into memory
data processing engine.
It has been built on top of the Hadoop distributed file system which make
the data more efficient, more reliable and make it more extendable on the
Hadoop map reduce.
Figure 1a.3
2. Spark Core:
● Spark Core is base and foundation for over all of the project
development and provide some most important Information like
distributed task, dispatching, scheduling and basic Input and output
functionalities.
● By using spark core, you can have more complex queries that will
help us to work with complex environment.
8
● Apache Spark Core support many more language and it has its own Data Science Technology
Stack, Business Layer &
built in function and API in java, Scala, python that means you can Utility Layer
write the application by using the java, python, C++, Scala etc.
● Apache Spark Core has come up with advanced analytics that means
it does not support the map and reduce the potential and capabilities to
support SQl Queries, Machine Learning and graph Algorithms.
Figure 1a.4
3. Spark SQL:
● Spark SQL is a component on top of the Spark Core that presents data
abstraction called Data Frames.
● Spark SQL is fast clustering data abstraction, so that data
manipulation can be done for fast computation.
● It enables the user to run SQL/HQL on top of the spark and by suing
this, we can process the structure, unstructured and semi structured
data.
● Spark SQL is Apache Spark’s module for working with structured and
semi-structured data and it originated to overcome the limitation of
apache hive.
● It always dependent upon the MapReduce engine of Hadoop for
execution and processing of data and allows the batch-oriented
operation.
● Hive lags in performance uses to MapReduce jobs for executing ad-
hoc process and hive does not allow you to resume a job processing, if
it fails in the middle.
● Spark performs better operation than hive in many situation. Latency
in the terms of hours and CPU reservation time.
9
Data Science ● You can integrate the Spark SQL and queryingstructured, semi-
structured data inside the apache spark.
● Spark SQL follows the RDD Model and it also support large job and
middle query fault tolerance.
● You can easily connect the Spark SQL with the JDBC and ODBC for
better connectivity of business purpose.
4. Spark Streaming:
● Apache Spark Streaming enables powerful interactive and data
analytics application for live streaming data. In Streaming, data is not
fixed and data comes from different source continuously.
● Stream divide the incoming input data into the small-small unit of
data for further data analytics and data processing for next level.
Figure 1a.5
10
5. GraphX: Data Science Technology
Stack, Business Layer &
Utility Layer
GraphX is very powerful graph processing tool application programming
interface for apache spark analytics engine.
● GraphX is a new component in a spark for graphs and graphs-parallel
computation.
● GraphX follow the ETL process that means Extract, transform and
Load, exploratory analysis, iterative graph computation within a
single system.
Figure 1a.6
● GraphX has more flexibilities to work with graph and computation.
Graph follow the ETL process that means Extract, transform and
Load, exploratory analysis, iterative graph computation within a
single system.
● Speed is one of the most important point in the point of Graph and it
is comparable with the fastest graph system while when there is any
fault tolerance and provide ease of use.
● We can choose lots of more feature that comes with more flexibilities
and reliability and it provide library of graph algorithms.
11
Data Science 6. Mesos:
● Apache Mesos is an open source cluster manager and it was
developed by the universities of California, Berkeley.
● It provides all the required resource for the isolation and sharing
purpose across all the distributed application.
● The software we are using for Mesos, provide resources sharing in a
fine-grained manner so that improving can be done with the help of
this.
● Mesosphere enterprises DC/OS is the enterprise version of Mesos and
this run specially on Kafka, Cassandra, spark and Akka.
Figure 1a.7
12
7. Akka: Data Science Technology
Stack, Business Layer &
Utility Layer
● Akka is an actor-based message driven runtime for running
concurrency, elasticity, and resilience processes.
● The actor can be controlled and limited to perform the intended task
only. Akka is an open source library or toolkit.
● Apache Akka is used to create distributed and fault tolerance and it
can be integrated to this library into the java virtual machine or JVM
to support the language.
● Akka could be integrated with the Scala programming language and it
is written in the Scala and it help us and developers to deal with
external locking and threat management.
Figure 1a.8
● The Actor is an entity which communicate with another actor by
passing the massage to each other and it has its own state and
behavior.
● In object-oriented programming like that everything is an object same
thing is here like Akka is an actor based driven system.
● In other way we can say that Actor is an object that include and
incapsulate it states and behavior.
8. Cassandra:
● Apache Cassandra is an open source distributed database system that
is designed for storing and managing large amount of data across
commodity servers.
● Cassandra can be used for both real time operational data store for
online transaction data application.
13
Data Science ● Cassandra is designed for to have peer to peer process continues
nodes instead of master or named nodes to ensure that there should
not be any single point of failure.
Figure 1a.9
14
Data Science Technology
Stack, Business Layer &
Utility Layer
Figure 1a.10
3. Kafka:
● Kafka is a high messaging backbone that enables communication
between data processing entities and Kafka is written in java and
Scala language.
● Apache Kafka is highly scalable, reliable, fast, and distributed system.
Kafka is suitable for both offline and online message consumption.
● Kafka messages are stored on the hard disk and replicated within the
cluster to prevent the data loss.
● Kafka is distributed, partitioned, replicated and fault tolerant which
make it more reliable.
● Kafka messaging system scales easily without down time which make
it more scalable. Kafka has high throughput for both publishing and
subscribing messages, and it can store data up to TB.
● Kafka has unique platform for handling the real time data for
feedback and it can handle large amount of data to diverse consumers.
● Kafka persists all data to the disk, which essentially means that all the
writes go to the page cache of the OS (RAM). This makes it very
efficient to transfer data from page cache to a network socket.
Figure 1a.11
15
Data Science Different Programming languages in data science processing:
1. Elastic Search:
● Elastic Search is an open source, distributed search and analytical
engine designed.
● Scalability mean that it can scale any point of view, reliability means
that it should be trustable, stress free management.
2. R Language:
● R is a programming language and it is used for statistical computing
and graphics purpose.
● R Language are used by data engineer, data scientist, statisticians, and
data miners for developing the software and performing data
analytics.
● There is core requirement before learning the R Language and some
depend on library and package concept that you should know about it
and know how to work upon it easily.
● The related packages are of R Language is sqldf, forecast, dplyr,
stringer, lubridate, ggplot2, reshape etc.
16
● R language has built in capability to support and can be implemented Data Science Technology
Stack, Business Layer &
and integrated with procedural language written in c, c++, java, .Net, Utility Layer
and python.
● R Language has capacity and potential for handling data and data
storage.
3. Scala:
● Scala is a general-purpose programming language and it support
functional programming and a strong type statics type system.
● Most of the data science project and framework are build by using the
Scala programming language because it has so many capabilities and
potential to work with it.
● Types and behavior of objects are described by the class and class can
be extended by another class by using its properties.
● Scala support the high-level functions and function can be called by
another function by using and written the function in a code.
● Once the Scala program is ready to compile and executive, Scala
program convert into the byte code (machine understandable
language) with the help of java virtual machine.
● This means that Scala and Java Programs can be complied and
executed by using the JVM. So, we can easily say that it can be
moved from Java to Scala and vice-versa.
● Scala enables you to use and import all the class, object and its
behavior and function because Scala and java run with the help of
Java Virtual Machine and you can create its own class and object.
4. Python:
● Python is a programming language and it can used on a server to
create web application.
● Python can be used for web development, mathematics, software
development and it is used to connect the database and create and
modify the data.
● Python can handle the large amount of data and it is capable and
potential to perform the complex task on data.
17
Data Science ● Python is reliable, portable, and flexible to work on different platform
like windows, mac and Linux etc.
● As compare to the other programming language , python is easy to
learn and can perform the simple as well as complex task and it has
the capabilities to reduce the line of code and help the programmer
and developers to work with is easily friendly manner.
● Python support object-oriented programming language, functional and
work with structure data.
● Python support dynamics data type and can be supported by dynamics
type checking.
● Python is an interpreter and it has the philosophy and statements that
it reduces the line of code.
SUMMARY
Chapter will help you to recognize the basics of data science tools and
their influence on modern data lake development. You will discover the
techniques for transforming a data vault into a data warehouse bus matrix.
It will explain the use of Spark, Mesos, Akka, Cassandra, and Kafka, to
tame your data science requirements. It will guide you in the use of elastic
search and MQTT (MQ Telemetry Transport), to enhance your data
science solutions. It will help you to recognize the influence of R as a
creative visualization solution. It will also introduce the impact and
influence on the data science ecosystem of such programming languages
as R, Python, and Scala.
REFERENCES
● Principal Data Science, Redundant Storage Architecture Andreas
Francois vermeulen,Apress,2018
● Principal Data Science, Sinan Ozdemir,PACKT 2016.
● Data Science from Scratch, Joel Grus, O’Really 2015.
● Data Science from Scratch first principle in Python, JoelGrus, Shroff
Publisher, 2017.
● Experimental Design in Data Science with Least Resources,N C Das,
Shroff Publishers 2018.
*****
18
MODULE 1
Unit I
1b
LAYERED FRAMEWORK
● Vermeulen-krennwallner-Hillman-Clark is small group like VKHCG
and has a small size international company and it consist of 4
subcomponent 1. Vermeulen PLC, 2. Krennwallner AG, 3. Hillman
Ltd 4. Clark Ltd.
1. Vermeulen PLC:
● Vermeulen PLC is a data processing company which process all the
data within the group companies.
● This is the company for which we hire most of the data engineer and
data scientist to work with it.
2. Krennwallner AG:
● This is an advertising and media company which prepares advertising
and media information which is required for the customers.
● Krennwallner supplies advertising on billboards, make Advertisingho
and content management for online delivery etc.
● By using the number of record and data which are available on
internet for media stream, it takes the data from there and make an
analysis on this according to that it searches which of the media
stream are watched by customer, how many time and which is most
watchable content on internet.
● By using the survey, it specifies and choose content for the billboards,
make and understand how many times customer are visited for which
channel.
3. Hillman Ltd:
● This is logistic and supply chain company and it is used to supply the
data around the worldwide for the business purpose.
● This include client warehouse, international shipping, home – to –
home logistics.
19
Data Science 4. Clark Ltd:
● This is the financial company which process all financial data which
is required for financial purpose includes Support Money, Venture
Capital planning and allow to put your money on share market.
Scala:
● Scala is a general-purpose programming language and it support
functional programming and a strong type statics type system.
● Most of the data science project and framework are built by using the
Scala programming language because it has so many capabilities and
potential to work with it.
Apache Spark:
● Apache Spark is an open source clustering computing framework. The
word open source means it is freely available on internet and just go
on internet and type apache spark and you will get freely source code
are available there, you can download and according to your wish.
Apache Mesos:
● Apache Mesos is an open source cluster manager and it was
developed by the universities of California, Berkeley.
20
● It provides all the required resource for the isolation and sharing Layered Framework
Akka:
● Akka is an actor-based message driven runtime for running
concurrency, elasticity, and resilience processes.
● The actor can be controlled and limited to perform the intended task
only. Akka is an open source library or toolkit.
● Apache Akka is used to create distributed and fault tolerant and it can
be integrated to the library into the java virtual machine or JVM to
support the language.
● Akka could be integrated with the Scala programming language and it
is written in the Scala and it help us and developers to deal with
external locking and threat management.
Apache Cassandra:
● Apache Cassandra is an open source distributed database system that
is designed for storing and managing large amount of data across
commodity servers.
● Cassandra can be used for both real time operational data store for
online transaction data application.
Kafka:
● Kafka is a high messaging backbone that enables communication
between data processing entities and Kafka is written in java and
Scala language.
Python:
● Python is a programming language and it can used on a server to
create web application.
● [Link]
● Python Libraries:
● Python library is a collection of functions and methods that allows
you to perform many actions without writing your code.
22
Pandas: Layered Framework
● Pandas stands for panel data and it is the core library for data
manipulation and data analysis.
Matplotlib:
● Matplotlib is used for data visualization and is one of the most
important packages of python.
● Matplotlib is used to display and visualize the 2D data and it is
written in python.
● It can be used for python, Jupiter, notebook and web application
server also.
● How to install Matplotlib Library for UBUNTU in python by using
the following command:
NumPy:
● NumPy is the fundamental package of python language and is used
for the numerical purpose.
● NumPy is used with the SciPy and Matplotlib package of python and
it is freely available on internet.
23
Data Science SymPy:
● Sympy is a python library and which is used for symbolic
mathematics and it can be used with complex algebra formula.
R:
● R is a programming language and it is used for statistical computing
and graphics purpose.
SUMMARY
Chapter will introduce you to new concepts that enable us to share insights
on a common understanding and terminology. It will define the Data
Science Framework in detail, while introducing the Homogeneous
Ontology for Recursive Uniform Schema (HORUS). It will take you on a
high-level tour of the top layers of the framework, by explaining the
fundamentals of the business layer, utility layer, operational management
layer, plus audit, balance, and control layers. It will discuss how to
engineer a layered framework for improving the quality of data science
when you are working in a large team in parallel with common business
requirements
10. Explain Mesos, Akka and Cassandra as data science processing tools.
11. List and explain different programming languages using in data
science processing.
12. What is MQTT? Explain the use of MQTT in data science.
REFERENCES
● Principal Data Science, Redundant Storage Architecture Andreas
Francois vermeulen,Apress,2018
● Principal Data Science, Sinan Ozdemir, PACKT 2016.
● Data Science from Scratch, Joel Grus, O’Really 2015.
● Data Science from Scratch first principle in Python, JoelGrus, Shroff
Publisher, 2017.
● Experimental Design in Data Science with Least Resources, N C Das,
Shroff Publishers 2018
*****
25
Unit II
Statistics for Data Science
2a
THREE MANAGEMENT LAYERS
Unit Structure
2a.0 Objectives
2a.1 Introduction
2a.2 Operational Management Layer
2a.2.1 Definition and Management of Data Processing stream
2a.2.2 Eco system Parameters
2a.2.3 Overall Process Scheduling
2a.2.4 Overall Process Monitoring
2a.2.5 Overall Communication
2a.2.6 Overall Alerting
2a.3 Audit, Balance, and Control Layer
2a.4 Yoke Solution
2a.5 Functional Layer
2a.6 Data Science Process
2a.7 Unit end Question
2a.8 References
2a.0 OBJECTIVES
● The objective is to explain in detail the core operations of the Three
Management Layers i.e. Operational Management Layer, Audit,
Balance, and Control Layer& the Functional Layers
2a.1 INTRODUCTION
● The Three Management Layers are a very important part of the
framework.
● They watch the overall operations in the data science ecosystem and
make sure that things are happening as per plan.
● If things are not going as per plan then it has contingency actions in
place for recovery or cleanup.
26
Three Management Layers
2a.2 OPERATIONAL MANAGEMENT LAYER
● Operations management is one of the areas inside the ecosystem
responsible for designing and controlling the process chains of a data
science environment.
● This layer is the center for complete processing capability in the data
science ecosystem.
● This layer stores what you want to process along with every
processing schedule and workflow for the entire ecosystem.
This area enables us to see an integrated view of the entire ecosystem. It
reports the status every processing in the ecosystem. This is where we plan
our data science processing pipelines.
● Overall Communication
● Overall Alerting
27
Data Science 1. Having a text file which we can import into every processing script.
2. A standard parameter setup script that defines aparameter database
which we can import into every processing script.
28
● This is done by tying or binding the remaining processes of the Three Management Layers
Process Monitoring:
● The central monitoring process makes sure that there is a single
unified view of the complete system.
● We should always ensure that the monitoring of our data science is
being done from a single point.
● With no central monitoring running different data science processes on
the same ecosystem will make managing a difficult task.
Overall Communication:
● The Operations management handles all communication from the
system, it makes sure that any activities that are happening are
communicated to the system.
● To make sure that we have all our data science processes trackedwe
may use a complex communication process.
● It is this layer that has the engine that makes sure that every processing
request is completed by the ecosystem according to the plan.
● This is the only area where you can observe which processes are is
currently running within your data scientist environment.
2a.3.1 Audit:
● An audit refers to an examination of the ecosystem that is systematic
and independent
● This sublayer records which processes are running at any given
specific point within the ecosystem.
● Data scientists and engineers use this information collected to better
understand and plan future improvements to the processing to be done.
● Error Watcher
● Fatal Watcher
Built-in Logging:
● It is always a good thing to design our logging in an organized
prespecified location, this ensures that we capture every relevant log
entry in one location.
● Changing the internal or built-in logging process of the data science
tools should be avoid as this complicate any future upgrades complex
and will prove very costly to correct.
● A built-in Logging mechanism along with a cause-and-effect
analysis system allows you to handle more than 95% of all issues that
can rise in the ecosystem.
● Since there are five layers it would be a good practice to have five
watchers for each logging locations independent of each other as
described below:
30
Debug Watcher: Three Management Layers
Information Watcher:
● The information watcher logs information that is beneficial to the
running and management of a system.
● It is advised that these logs be piped to the central Audit, Balance, and
Control data store of the ecosystem.
Warning Watcher:
● Warning is usually used for exceptions that are handled or other
important log events.
● Usually this means that the issue was handled by the tool and also took
corrective action for recovery.
● It is advised that these logs be piped to the central Audit, Balance, and
Control data store of the ecosystem.
Error Watcher:
● An Error logs all unhandled exceptions in the data science tool.
● An Error is a state of the system. This state is not good for the overall
processing, since it normally means that a specific step did not
complete as expected.
● In case of an error the ecosystem should handle the issue and take the
necessary corrective action for recovery.
● It is advised that these logs be piped to the central Audit, Balance, and
Control data store of the ecosystem.
Fatal Watcher:
● Fatal is a state reserved for special exceptions or conditions for which
it is mandatory that the event causing this state be identified
immediately.
● This state is not good for the overall processing, since it normally
means that a specific step did not complete as expected.
● In case of an fatal error the ecosystem should handle the issue and take
the necessary corrective action for recovery.
● It is advised that these logs be piped to the central Audit, Balance, and
Control data store of the ecosystem.
31
Data Science ● Basic Logging: Every time a process is executed this logging allows
you to log everything that occurs to a central file.
Process Tracking:
● For Process Tracking it is advised to create a tool that will perform a
controlled, systematic and independent examination of the process for
the hardware logging.
● There may be numerous server-based software that monitors system
hardware related parameters like voltage, fan speeds, temperature
sensors and clock speeds of a computer system.
● It is advised to use the tool which your customer and you bot are most
comfortable to work with.
Data Provenance:
● For every data entity all the transformations in the system should be
tracked so that a record can be generated for activity.
● This ensures two things: 1. that we can reproduce the data, if required,
in the future and 2. That we can supply a detailed history of the data’s
source in the system throughout its transformation.
Data Lineage:
● This involves keeping records of every change whenever it happens to
every individual data value in the data lake.
● This help us to figure out the exact value of any data item in the past.
● This is normally accomplished by enforcing a valid-from and valid-to
audit entry for every data item in the data lake.
2a.3.2 Balance:
● The balance sublayer has the responsibility to make sure that the data
science environment is balanced between the available processing
capability against the required processing capability or has the ability
to upgrade processing capability during periods of extreme processing.
2a.3.3 Control:
● The execution of the current active data science processes is controlled
by the control sublayer.
32
● the control element available in the Data Science Technology Three Management Layers
2a.4.1 Producer:
● The producer is the part of the system that generates the requests for
data science processing, by creating structures messages for each type
of data science process it requires.
● The producer is the end point of the pipeline that loads messages into
Kafka.
2a.4.2 Consumer:
● The consumer is the part of the process that takes in messages and
organizes them for processing by the data science tools.
● The consumer is the end point of the pipeline that offloads the
messages from Kafka.
● You can use the Python NetworkX library to resolve any conflicts, by
simply formulating the graph into a specific point before or after you
send or receive messages via Kafka.
34
● Data structures in the functional layer of the ecosystem are: Three Management Layers
• Data schemas and data formats: Functional data schemas and data
formats deploy onto the data lake’s raw data, to perform the required
schema-on-query via the functional layer.
• Data models: These form the basis for future processing to enhance
the processing capabilities of the data lake, by storing already
processed data sources for future use by other processes against the
data lake.
• Processing algorithms: The functional processing is performed via a
series of well-designed algorithms across the processing chain.
• Provisioning of infrastructure: The functional infrastructure
provision enables the framework to add processing capability to the
ecosystem, using technology such as Apache Mesos, which enables
the dynamic previsioning of processing work cells.
● The processing algorithms and data models are spread across six
supersteps for processing the data lake.
1. Retrieve: This super step contains all the processing chains for
retrieving data from the raw data lake into a more structured format.
3. Assess: This super step contains all the processing chains for quality
assurance and additional data enhancements.
3. Process: This super step contains all the processing chains for
building the data vault.
4. Transform: This super step contains all the processing chains for
building the data warehouse from the core data vault.
5. Organize: This super step contains all the processing chains for
building the data marts from the core data warehouse.
6. Report: This super step contains all the processing chains for
building virtualization and reporting of the actionable knowledge.
REFERENCES
Andreas François Vermeulen, “Practical Data Science - A Guide to
Building the Technology Stack for Turning Data Lakes into Business
Assets”
*****
36
Unit II
2b
RETRIEVE SUPER STEP
Unit Structure
2b.0 Objectives
2b.1 Introduction
2b.2 Data Lakes
2b.3 Data Swamps
2b.3.1 Start with Concrete Business Questions
2b.3.2 Data Quality
2b.3.4 Audit and Version Management
2b.3.5 Data Governance
2b.3.5.1. Data Source Catalog
2b.3.5.2. Business Glossary
2b.3.5.3. Analytical Model Usage
2b.4 Training the Trainer Model
2b.5 Shipping Terminologies
2b.5.1 Shipping Terms
2b.5.2 Incoterm 2010
2b.6 Other Data Sources /Stores
2b.7 Review Questions
2b.8 References
2b.0 OBJECTIVES
● The objective of this chapter is to explain in detail the core operations
in the Retrieve Super step.
● This chapter explains important guidelines which if followed will
prevent the data lake turning into a data swamp.
37
Data Science
2b.1 INTRODUCTION
● The Retrieve super step is a practical method for importing a data lake
consisting of different external data sources completely into the
processing ecosystem.
● The Retrieve super step is the first contact between your data science
and the source systems.
● The successful retrieval of the data is a major stepping-stone to
ensuring that you are performing good data science.
● Data lineage delivers the audit trail of the data elements at the lowest
granular level, to ensure full data governance.
● Data governance supports metadata management for system
guidelines, processing strategies, policies formulation, and
implementation of processing.
● Data quality and master data management helps to enrich the data
lineage with more business values, if you provide complete data
source metadata.
● The Retrieve super step supports the edge of the ecosystem, where
your data science makes direct contact with the outside data world. I
will recommend a current set of data structures that you can use to
handle the deluge of data you will need to process to uncover critical
business knowledge.
● Just as a lake needs rivers and streams to feed it, the data lake will
consume an unavoidable deluge of data sources from upstream and
deliver it to downstream partners
38
1. Start with Concrete Business Questions Retrieve Super Step
2. Data Quality
3. Audit and Version Management
4. Data Governance
39
Data Science ● Data processing should include the following rules:
● Expected frequency
• Irregular i.e., no fixed frequency, also known as ad hoc, every
minute, hourly, daily, weekly, monthly, or yearly.
• Other options are near-real-time, every 5 seconds, every minute,
hourly, daily, weekly, monthly, or yearly.
40
● Unique data mapping number: use NNNNNNN/ NNNNNNNNN. Retrieve Super Step
● internal field 1
● External data source field name: States the field as found in the
raw data source
● External data source field type: Records the full set of the field’s
data types when loading the data lake
● Internal data source field name: Records every internal data field
name to use once loaded from the data lake
● Internal data source field type: Records the full set of the field’s
types to use internally once loaded
41
Data Science ● Data Field Name Verification
• This is used to validate and verify the data field’s names in the
retrieve processing in an easy manner.
• Example
library(table)
set_tidy_names(INPUT_DATA, syntactic = TRUE,
quiet = FALSE)
INPUT_DATA_with_ID=
Row_ID_to_column(INPUT_DATA_FIX, var =
"Row_ID")
sapply(INPUT_DATA_with_ID, typeof)
library([Link])
country_histogram=[Link](Country=unique(INPUT_D
ATA_with_ID[[Link](INPUT_DATA_with_ID ['Country'])
== 0, ]$Country))
● Minimum Value
• Determine the minimum value in a specific column.
42
• Example: find minimum value Retrieve Super Step
min(country_histogram$Country)
or
sapply(country_histogram[,'Country'], min, [Link]=TRUE)
● Maximum Value
• Determine the maximum value in a specific column.
• Example: find maximum value
max(country_histogram$Country)
or
sapply(country_histogram[,'Country'], max, [Link]=TRUE)
● Mean
• If the column is numeric in nature, determine the average value in a
specific column.
• Example: find mean of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], mean,
[Link]=TRUE)
● Median
• Determine the value that splits the data set into two parts in a
specific column.
• Example:find median of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], median,
[Link]=TRUE)
● Mode
• Determine the value that appears most in a specific column.
• Example: Find mode for column country
INPUT_DATA_COUNTRY_FREQ=[Link](with(INPU
T_DATA_with_ID, table(Country)))
● Range
• For numeric values, you determine the range of the values by taking
the maximum value and subtracting the minimum value.
• Example: find range of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], range,
[Link]=TRUE
43
Data Science ● Quartiles
• These are the base values that divide a data set in quarters. This is
done by sorting the data column first and then splitting it in groups
of four equal parts.
• Example: find quartile of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], quantile,
[Link]=TRUE)
● Standard Deviation
• The standard deviation is a measure of the amount of variation or
dispersion of a set of values.
• Example: find standard deviation of latitude
sapply(lattitue_histogram_with_id[,'Latitude'], sd,
[Link]=TRUE)
● Skewness
• Skewness describes the shape or profile of the distribution of the
data in the column.
• Example: find skewness of latitude
library(e1071)
skewness(lattitue_histogram_with_id$Latitude, [Link] =
FALSE, type = 2)
missing_country=[Link](Country=unique(INPUT_DAT
A_with_ID[[Link](INPUT_DATA_with_ID ['Country']) ==
1, ]))
● Data Pattern
• I have used the following process for years, to determine a pattern
of the data values themselves.
• Here is my standard version:
• Replace all alphabet values with an uppercase case A, all numbers
with an uppercase N, and replace any spaces with a lowercase letter
band all other unknown characters with a lowercase u.
• As a result, “Data Science 102” becomes
"AAAAbAAAAAAAbNNNu.” This pattern creation is beneficial
for designing any specific assess rules.
44
Retrieve Super Step
2b.4 TRAINING THE TRAINER MODEL
● To prevent a data swamp, it is essential that you train your team also.
Data science is a team effort.
● People, process, and technology are the three cornerstones to ensure
that data is curated and protected.
● You are responsible for your people; share the knowledge you acquire
from this book. The process I teach you, you need to teach them.
Alone, you cannot achieve success.
● Technology requires that you invest time to understand it fully. We are
only at the dawn of major developments in the field of data
engineering and data science.
● Remember: A big part of this process is to ensure that business users
and data scientists understand the need to start small, have concrete
questions in mind, and realize that there is work to do with all data to
achieve success.
● EXW—Ex Works
• Here the seller will make the product or goods available at his
premises or at another named place. This term EXW puts the
minimum obligations on the seller of the product /item and
maximum obligation on the buyer.
• Here is the data science version: If I were to buy an item a local
store and take it home, and the shop has shipped it EXW—Ex
Works, the moment I pay at the register, the ownership is
transferred to me. If anything happens to the book, I would have to
pay to replace it.
● FCA—Free Carrier
• In this condition, the seller is expected to deliver the product or
goods, that are cleared for export, at a named place.
• The data science version: If I were to buy an item at an overseas
duty-free shop and then pick it up at the duty-free desk before
taking it home, and the shop has shipped it FCA— Free Carrier—
to the duty-free desk, the moment I pay at the register, the
ownership is transferred to me, but if anything happens to the book
between the shop and the duty-free desk, the shop will have to pay.
• It is only once I pick it up at the desk that I will have to pay, if
anything happens. So, the moment I take the book, the transaction
becomes EXW, so I have to pay any necessary import duties on
arrival in my home country.
● CPT—Carriage Paid To
• Under this term, the seller is expected to pay for the carriage of
product or goods up to the named place of destination.
46
• The moment the product or goods are delivered to the first carrier Retrieve Super Step
● DAT—Delivered at a Terminal:
• According to this term the seller has to deliver and unload the
goods at a named terminal. The seller assumes all risks till the
delivery at the destination and has to pay all incurred costs of
transport including export fees, carriage, unloading from the main
carrier at destination port, and destination port charges.
• The terminal can be a port, airport, or inland freight interchange,
but it must be a facility with the capability to receive the shipment.
If the seller is not able to organize unloading, it should consider
47
Data Science shipping under DAP terms instead. All charges after unloading (for
example, import duty, taxes, customs and on-carriage costs) are to
be borne by buyer.
• The data science version. If I were to buy an item at an overseas
store and then pick it up at a local store before taking it home, and
the overseas shop shipped it—Delivered at Terminal (Local
Shop)—the moment I pay at the register, the ownership is
transferred to me.
• However, if anything happens to the book between the payment
and the pickup, the local shop pays. It is picked up only once at the
local shop. I have to pay if anything happens. So, the moment I
take it, the transaction becomes EXW, so I have to pay any import
duties on arrival in my home.
● DAP—Delivered at Place:
• Under this option the seller delivers the goods at a given place of
destination. Here, the risk willpass from seller to buyer from
destination point.
• Packaging cost at the origin has to be paid by the seller alsoall the
legal formalities in the exporting country will be carried out by the
seller at his own expense.
• Once the goods are delivered in the destination country the buyer
has to pay for the customs clearance.
• Here is the data science version. If I were to buy 100 pieces of a
particular item from an overseas web site and then pick up the
copies at a local store before taking them home, and the shop
shipped the copies DAP-Delivered At Place (Local Shop)— the
moment I paid at the register, the ownership would be transferred
to me. However, if anything happened to the item between the
payment and the pickup, the web site owner pays. Once the 100
pieces are picked up at the local shop, I have to pay to unpack
them at store. So, the moment I take the copies, the transaction
becomes EXW, so I will have to pay costs after I take the copies.
● SQLite
• This requires a package named sqlite3.
● Oracle
• Oracle is a common database storage option in bigger companies.
It enables you to load data from the following data source with
ease:
engine =
create_engine('oracle://andre:vermeulen@[Link]:1521/vermeulen')
● MySQL
• MySQL is widely used by lots of companies for storing data. This
opens that data to your data science with the change of a simple
connection string.
• There are two options. For direct connect to the database, use
49
Data Science ● Apache Cassandra
• Cassandra is becoming a widely distributed databaseengine in the
corporate world.
• To access it, use the Python package cassandra.
● Apache Hadoop
• Hadoop is one of the most successful data lake ecosystems in
highly distributed data Science.
• The pydoop package includes a Python MapReduce and HDFS
API for Hadoop.
● Pydoop 9
• It is a Python interface to Hadoop that allows you to write
MapReduce applications and interact with HDFS in pure Python
● Microsoft Excel
• Excel is common in the data sharing ecosystem, and it enables you
to load files using this format with ease.
● Apache Spark
• Apache Spark is now becoming the next standard for distributed
data processing. The universal acceptance and support of the
processing ecosystem is starting to turn mastery of this technology
into a must-have skill.
● Apache Hive
• Access to Hive opens its highly distributed ecosystem for use by
data scientists.
● Luigi
• Luigi enables a series of Python features that enable you to build
complex pipelines into batch jobs. It handles dependency
resolution and workflow management as part of the package.
• This will save you from performing complex programming while
enabling good quality processing
● Amazon S3 Storage
• S3, or Amazon Simple Storage Service (Amazon S3), creates
simple and practical methods to collect, store, and analyze data,
50
irrespective of format, completely at massive scale. I store most of Retrieve Super Step
● Amazon Redshift
• Amazon Redshift is cloud service that is a fully managed,
petabyte-scale data warehouse.
• The Python package redshift-sqlalchemy, is an Amazon Redshift
dialect for sqlalchemythat opens this data source to your data
science
51
Data Science
2b.8 REFERENCES
Books:
● Andreas François Vermeulen, “Practical Data Science - A Guide to
Building the Technology Stack for Turning Data Lakes into Business
Assets”
Websites:
● [Link]
● Incoterm: [Link]
*****
52
2C
ASSESS SUPERSTEP
Unit Structure
2c.0 Objectives
2c.1 Assess Superstep
2c.2 Errors
2c.2.1 Accept the Error
2c.2.2 Reject the Error
2c.2.3 Correct the Error
2c.2.4 Create a Default Value
2c.3 Analysis of Data
2c.3.1 Completeness
2c.3.2 Consistency
2c.3.3 Timeliness
2c.3.4 Conformity
2c.3.5 Accuracy
2c.3.6 Integrity
2c.4 Practical Actions
2c.4.1 Missing Values in Pandas
2c.4.1.1 Drop the Columns Where All Elements Are Missing Values
2c.4.1.2 Drop the Columns Where Any of the Elements Is Missing
Values
2c.4.1.3 Keep Only the Rows That Contain a Maximum of Two
Missing Values
2c.4.1.4 Fill All Missing Values with the Mean, Median, Mode,
Minimum, and Maximum of the Particular Numeric
Column
2c.5 Let us Sum up
2c.6 Unit End Questions
2c.7 List of References
2C.0 OBJECTIVES
This chapter makes you understand the following concepts:
53
Data Science ● Principles of data analysis
2C.2 ERRORS
Errors are the norm, not the exception, when working with data. By now,
you’ve probably heard the statistic that 88% of spreadsheets contain
errors. Since we cannot safely assume that any of the data we work with is
error-free, our mission should be to find and tackle errors in the most
efficient way possible.
54
Assess Superstep
User Device OS Transactions
A Mobile Android 5
B Mobile Window 3
C Tablet NA 4
D NA Android 1
E Mobile IOS 2
Table 2c.1
In the above case, the entire observation for User C and User D will be
ignored for listwise deletion. b. Pairwise: In this case, only the missing
observations are ignored and analysis is In the above case, 2 separate
sample data will be analyzed, one with the combination of User, Device
and Transaction and the other with the combination of User, OS and
Transaction. In such a case, one won't be deleting any observation. Each
of the samples will ignore the variable which has the missing value in it.
Both the above methods suffer from loss of information. Listwise deletion
suffers the maximum information loss compared to Pairwise deletion. But,
the problem with pairwise deletion is that even though it takes the
available cases, one can’t compare analyses because the sample is
different every time.
Use reject the error option if you can afford to lose a bit of data. This is an
option to be used only if the number of missing values is 2% of the whole
dataset or less.
Table 2c.2
55
Data Science Can you point out a few inconsistencies? Write them down a few and
check your answers below!
1. First, there are empty cells for the "country" and "date of birth
variables". We call these missing attributes.
2. If you look at the "Country" column, you see a cell that contains 24.
“24” is definitely not a country! This is known as a lexical error.
3. Next, you may notice in the "Height" column that there is an entry
with a different unit of measure. Indeed, Rodney's height is recorded in
feet and inches while the rest are recorded in meters. This is an
irregularity error because the unit of measures are not uniform.
5. Mark has two email addresses. It’s is not necessarily a problem, but if
you forget about this and code an analysis program based on the
assumption that each person has only one email address, your program
will probably crash! This is called a formatting error.
5. Look at the "date of birth" variable. There is also a formatting error
here as Rob’s date of birth is not recorded in the same format as the
others.
6. Samuel appears on two different rows. But, how can we be sure this is
the same Samuel? By his email address, of course! This is called a
duplication error. But look closer, Samuel’s two rows each give a
different value for the "height variable": 1.67m and 1.45m. This is
called a contradiction error.
7. Honey is apparently 9'1". This height diverges greatly from the normal
heights of human beings. This value is, therefore, referred to as
anoutlier.
The term outlier can indicate two different things: an atypical value and
an aberration.
Figure 2c.1
57
Data Science One of the causes of data quality issues is in source data that is housed in a
patchwork of operational systems and enterprise applications. Each of
these data sources can have scattered or misplaced values, outdated and
duplicate records, and inconsistent (or undefined) data standards and
formats across customers, products, transactions, financials and more.
Data quality problems can also arise when an enterprise consolidates data
during a merger or acquisition. But perhaps the largest contributor to data
quality issues is that the data are being entered, edited, maintained,
manipulated and reported on by people.
To maintain the accuracy and value of the business-critical operational
information that impact strategic decision-making, businesses should
implement a data quality strategy that embeds data quality techniques into
their business processes and into their enterprise applications and data
integration.
2c.3.1 Completeness:
Completeness is defined as expected comprehensiveness. Data can be
complete even if optional data is missing. As long as the data meets the
expectations then the data is considered complete.
For example, a customer’s first name and last name are mandatory but
middle name is optional; so a record can be considered complete even if a
middle name is not available.
Questions you can ask yourself: Is all the requisite information available?
Do any data values have missing elements? Or are they in an unusable
state?
2c.3.2 Consistency:
Consistency means data across all systems reflects the same information
and are in synch with each other across the enterprise.
Examples:
● A business unit status is closed but there are sales for that business
unit.
2c.3.3 Timeliness:
Timeliness referes to whether information is available when it is expected
and needed. Timeliness of data is very important. This is reflected in:
● Companies that are required to publish their quarterly results within a
given frame of time
58
● Customer service providing up-to date information to the customers Assess Superstep
2c.3.4 Conformity:
Conformity means the data is following the set of standard data definitions
like data type, size and format. For example, date of birth of customer is in
the format “mm/dd/yyyy” Questions you can ask yourself: Do data values
comply with the specified formats? If so, do all the data values comply
with those formats?
Maintaining conformance to specific formats is important.
2c.3.5 Accuracy:
Accuracy is the degree to which data correctly reflects the real world
object OR an event being described. Examples:
2c.3.6 Integrity:
Integrity means validity of data across the relationships and ensures that
all data in a database can be traced and connected to other data.
For example, in a customer database, there should be a valid customer,
addresses and relationship between them. If there is an address
relationship data without a customer then that data is not valid and is
considered an orphaned record.
Ask yourself: Is there are any data missing important relationship
linkages? The inability to link related records together may actually
introduce duplication across your systems.
59
Data Science 2c.4.1 Missing Values in Pandas:
Following are four basic processing concepts.
1. Drop the Columns Where All Elements Are Missing Values
2. Drop the Columns Where Any of the Elements Is Missing Values
3. Keep Only the Rows That Contain a Maximum of Two Missing
Values
4. Fill All Missing Values with the Mean, Median, Mode, Minimum
2c.4.1.1. Drop the Columns Where All Elements Are Missing Values
Importing data:
Step 1: Importing necessary libraries:
import os
import pandas as pd
Step 2: Changing the working directory:
[Link]("D:\Pandas")
Pandas provides various data structures and operations for manipulating
numerical data and time series. However, there can be cases where some
data might be missing. In Pandas missing data is represented by two
values:
● None: None is a Python singleton object that is often used for missing
data in Python code.
● NaN: NaN (an acronym for Not a Number), is a special floating-point
value recognized by all systems that use the standard IEEE floating-
point representation
Pandas treat None and NaN as essentially interchangeable for indicating
missing or null values. In order to drop a null values from a dataframe, we
used dropna() function this function drop Rows/Columns of datasets with
Null values in different ways.
Syntax:
[Link](axis=0, how=’any’, thresh=None, subset=None,
inplace=False)
Parameters:
● axis: axis takes int or string value for rows/columns. Input can be 0 or
1 for Integer and ‘index’ or ‘columns’ for String.
60
● how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ Assess Superstep
drops the row/column if ANY value is Null and ‘all’ drops only if
ALL values are null.
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Table 2c.3
Here, column C is having all NaN values. Let’s drop this column. For this
use the following code.
Code:
import pandas as pd
import numpy as np
df = [Link]([[[Link], 2, [Link], 0], [3, 4, [Link], 1],
[[Link], [Link], [Link], 5]],
columns=list('ABCD'))
df # it will print the data frame
[Link](axis=1, how='all') # this code will delete the columns with all
null values.
Here, axis=1 means columns and how=’all’ means drop the columns with
all NaN values.
A B D
0 NaN 2.0 0
61
Data Science
1 3.0 4.0 1
2 NaN NaN 5
Table 2c.4
2c.4.1.2. Drop the Columns Where Any of the Elements Is Missing
Values:Let’s consider the same dataframe again:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Table 2c.5
Here, column A, B and C are having all NaN values. Let’s drop these
columns. For this use the following code
Code:
import pandas as pd
import numpy as np
df = [Link]([[[Link], 2, [Link], 0], [3, 4, [Link], 1],
[[Link], [Link], [Link], 5]],
columns=list('ABCD'))
df # it will print the data frame
[Link](axis=1, how='any') # this code will delete the columns with all
null values.
Here, axis=1 means columns and how=’any’ means drop the columns with
one or noreNaN
values.
D
0 0
1 1
2 5
Table 2c.6
62
2c.4.1.3. Keep Only the Rows That Contain a Maximum of Two Assess Superstep
Missing Values:
Let’s consider the same dataframe again:
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 NaN NaN NaN 5
Table 2c.7
Here, row 2 is having more than 2 NaN values. So, this row will get
dropped. For this use the following code
Code:
# importing pandas as pd
import pandas as pd
import numpy as np
df = [Link]([[[Link], 2, [Link], 0], [3, 4, [Link], 1],
[[Link], [Link], [Link], 5]],
columns=list('ABCD'))
df
[Link](thresh=2)
# this code will delete the rows with more than two null values.
Here, thresh=2 means maximum two NaN will be allowed per row.
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
Table 2c.8
2c.4.1.4. Fill All Missing Values with the Mean, Median, Mode,
Minimum
Another approach to handling missing values is to impute or estimate
them. Missing value imputation has a long history in statistics and has
63
Data Science been thoroughly researched. In essence, imputation uses information and
relationships among the non-missing predictors to provide an estimate to
fill in the missing value. The goal of these techniques is to ensure that the
statistical distributions are tractable and of good enough quality to support
subsequent hypothesis testing. The primary approach in this scenario is to
use multiple imputations; several variations of the data set are created with
different estimates of the missing values. The variations of the data sets
are then used as inputs to models and the test statistic replicates are
computed for each imputed data set. From these replicate statistics,
appropriate hypothesis tests can be constructed and used for decision
making.
A simple guess of a missing value is the mean, median, or mode (most
frequently appeared value) of that variable.
df
[Link]([Link]())
Output:
Output:
66
Assess Superstep
Basket 3 55 NaN 8 12
Basket 4 15 14 NaN 12
Basket 5 7 1 1 NaN
Basket 6 NaN 4 9 2
Table 2c.13
Here, we can see NaN in all the columns. Let’s fill it by their mode. For
this, use the following code:
import pandas as pd
import numpy as np
df = [Link]([[10, [Link], 30, 40], [7, 14, 8, 28], [55, [Link],
8, 12],
[15, 14, [Link], 12], [7, 1, 1, [Link]], [[Link], 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
for column in [Link]:
df[column].fillna(df[column].mode()[0], inplace=True)
df
Output:
67
Data Science
Basket 6 7.0 4 9 2
Table 2c.14
Here, the mode of Apple Column = (10, 7, 55, 15, 7) = 7. So, Nan value is
replaced by 7. Similarly, in Orange Column Nan’s are replaced with 14, in
Banana’s column Nan replaced with 8 and in Pear’s column it is replaced
with 12.
Table 2c.15
Here, we can see NaN in all the columns. Let’s fill it by their minimum
value. For this, use the following code:
import pandas as pd
import numpy as np
df = [Link]([[10, [Link], 30, 40], [7, 14, 21, 28], [55, [Link], 8,
12],
[15, 14, [Link], 8], [7, 1, 1, [Link]], [[Link], 4, 9, 2]],
columns=['Apple', 'Orange', 'Banana', 'Pear'],
index=['Basket1', 'Basket2', 'Basket3', 'Basket4',
'Basket5', 'Basket6'])
df
[Link]([Link]())
68
Output: Assess Superstep
*****
70
2c1
ASSESS SUPERSTEP
Unit Structure
2c1.0 Objectives
2c1.1 Engineering a Practical Assess Superstep
2c1.2 Unit End Questions
2c1.3 References
2c1.0 OBJECTIVES
This chapter will make you understand the practical concepts of:
● Assess superstep
Network X provides:
● tools for the study of the structure and dynamics of social, biological,
and infrastructure networks;
● a standard programming interface and graph implementation that is
suitable for many applications;
● a rapid development environment for collaborative, multidisciplinary
projects;
● an interface to existing numerical algorithms and code written in C,
C++, and FORTRAN; and the ability to painlessly work with large
nonstandard data sets.
71
Data Science
With NetworkX you can load and store networks in standard and
nonstandard data formats, generate many types of random and classic
networks, analyze network structure, build network models, design new
network algorithms, draw networks, and much more.
Graph Theory:
In the Graph Theory, a graph has a finite set of vertices (V) connected to
two-elements (E).
Each vertex (v) connecting two destinations, or nodes, is called a link or
an edge. Consider the Graph of bike paths below: sets {K,L}, {F,G},
{J,H}, {H,L}, {A,B}, and {C,E} are examples of edges.
Figure 2c1.1
The total number of edges for each node is the degree of that node.
In the Graph above, M has a degree of 2 ({M,H} and {M,L}) while B has
a degree of 1 ({B,A}). Degree is described formally as:
72
Assess Superstep
Figure 2c1.2
Neo4J’s book on graph algorithms provides a clear summary
73
Data Science
Figure 2c1.3
For example:
# ### Creating a graph
# Create an empty graph with no nodes and no edges.
import networkx as nx
G = [Link]()
# By definition, a `Graph` is a collection of nodes (vertices) along with
identified pairs of
# nodes (called # edges, links, etc). In NetworkX, nodes can be any
[hashable] object e.g., a
# text string, an image, an # XML object, another Graph, a customized
node object, etc.
# # Nodes
# The graph `G` can be grown in several ways. NetworkX includes many
graph generator
# functions # and facilities to read and write graphs in many formats.
# To get started # though we’ll look at simple manipulations. You can add
one node at a
# time,
G.add_node(1)
# or add nodes from any [iterable] container, such as a list
G.add_nodes_from([2, 3])
# Nodes from one graph can be incorporated into another:
74
H = nx.path_graph(10) Assess Superstep
G.add_nodes_from(H)
# `G` now contains the nodes of `H` as nodes of `G`.
# In contrast, you could use the graph `H` as a node in `G`.
G.add_node(H)
# The graph `G` now contains `H` as a node. This flexibility is very
powerful as it allows
# graphs of graphs, graphs of files, graphs of functions and much more. It
is worth thinking
# about how to structure # your application so that the nodes are useful
entities. Of course
# you can always use a unique identifier # in `G` and have a separate
dictionary keyed by
# identifier to the node information if you prefer.
# # Edges
# `G` can also be grown by adding one edge at a time,
G.add_edge(1, 2)
e = (2, 3)
G.add_edge(*e) # unpack edge tuple*
# by adding a list of edges,
G.add_edges_from([(1, 2), (1, 3)])
# or by adding any ebunch of edges. An *ebunch* is any iterable container
of edge-tuples.
# An edge-tuple can be a 2-tuple of nodes or a 3-tuple with 2 nodes
followed by an edge
# attribute dictionary, e.g.,
# `(2, 3, {'weight': 3.1415})`. Edge attributes are discussed further below.
G.add_edges_from([Link])
# There are no complaints when adding existing nodes or edges.
# For example, after removing all # nodes and edges,
[Link]()
75
Data Science # we add new nodes/edges and NetworkX quietly ignores any that are
already present.
G.add_edges_from([(1, 2), (1, 3)])
G.add_node(1)
G.add_edge(1, 2)
G.add_node("spam") # adds node "spam"
G.add_nodes_from("spam") # adds 4 nodes: 's', 'p', 'a', 'm'
G.add_edge(3, 'm')
# At this stage the graph `G` consists of 8 nodes and 3 edges, as can be
seen by:
G.number_of_nodes()
G.number_of_edges()
# # Examining elements of a graph
# We can examine the nodes and edges. Four basic graph properties
facilitate reporting:
#`[Link]`,
# `[Link]`, `[Link]` and `[Link]`. These are set-like views of the nodes,
edges, neighbors
# (adjacencies), and degrees of nodes in a graph. They offer a continually
updated read-only
#view into the graph structure. They are also dict-like in that you can look
up node and edge
#data attributes via the views and iterate with data attributes using
methods `.items()`,
#`.data('span')`.
# If you want a specific container type instead of a view, you can specify
one.
# Here we use lists, though sets, dicts, tuples and other containers may be
better in other
#contexts.
list([Link])
list([Link])
list([Link][1]) # or list([Link](1))
76
[Link][1] # the number of edges incident to 1 Assess Superstep
# One can specify to report the edges and degree from a subset of all
nodes using an
#nbunch.
# An *nbunch* is any of: `None` (meaning all nodes), a node, or an
iterable container of nodes that is # not itself a node in the graph.
[Link]([2, 'm'])
[Link]([2, 3])
# # Removing elements from a graph
# One can remove nodes and edges from the graph in a similar fashion to
adding.
# Use methods `Graph.remove_node()`, `Graph.remove_nodes_from()`,
#`Graph.remove_edge()`
# and `Graph.remove_edges_from()`, e.g.
G.remove_node(2)
G.remove_nodes_from("spam")
list([Link])
G.remove_edge(1, 3)
# # Using the graph constructors
# Graph objects do not have to be built up incrementally - data specifying
# graph structure can be passed directly to the constructors of the various
graph classes.
# When creating a graph structure by instantiating one of the graph
# classes you can specify data in several formats.
G.add_edge(1, 2)
H = [Link](G) # create a DiGraph using the connections from G
list([Link]())
edgelist = [(0, 1), (1, 2), (2, 3)]
H = [Link](edgelist)
# # What to use as nodes and edges
# You might notice that nodes and edges are not specified as NetworkX
77
Data Science # objects. This leaves you free to use meaningful items as nodes and
# edges. The most common choices are numbers or strings, but a node can
# be any hashable object (except `None`), and an edge can be associated
# with any object `x` using `G.add_edge(n1, n2, object=x)`.
# As an example, `n1` and `n2` could be protein objects from the RCSB
Protein Data Bank,
#and `x` # could refer to an XML record of publications detailing
experimental observations
#of their interaction.
# We have found this power quite useful, but its abuse can lead to
surprising behavior
#unless one is # familiar with Python.
# If in doubt, consider using `convert_node_labels_to_integers()` to obtain
a more
traditional graph with # integer labels. Accessing edges and neighbors
# In addition to the views `[Link]`, and `[Link]`, access to edges
and neighbors is
#possible using subscript notation.
G = [Link]([(1, 2, {"color": "yellow"})])
G[1] # same as [Link][1]
G[1][2]
[Link][1, 2]
# You can get/set the attributes of an edge using subscript notation
# if the edge already exists
G.add_edge(1, 3)
G[1][3]['color'] = "blue"
[Link][1, 2]['color'] = "red"
[Link][1, 2]
# Fast examination of all (node, adjacency) pairs is achieved using
# `[Link]()`, or `[Link]()`.
# Note that for undirected graphs, adjacency iteration sees each edge
twice.
78
FG = [Link]() Assess Superstep
79
Data Science [Link]['day'] = "Monday"
[Link]
# # Node attributes
# Add node attributes using `add_node()`, `add_nodes_from()`, or
`[Link]`
G.add_node(1, time='5pm')
G.add_nodes_from([3], time='2pm')
[Link][1]
[Link][1]['room'] = 714
[Link]()
# Note that adding a node to `[Link]` does not add it to the graph, use
# `G.add_node()` to add new nodes. Similarly for edges.
# # Edge Attributes
# Add/change edge attributes using `add_edge()`, `add_edges_from()`,
# or subscript notation.
G.add_edge(1, 2, weight=4.7 )
G.add_edges_from([(3, 4), (4, 5)], color='red')
G.add_edges_from([(1, 2, {'color': 'blue'}), (2, 3, {'weight': 8})])
G[1][2]['weight'] = 4.7
[Link][3, 4]['weight'] = 4.2
# The special attribute `weight` should be numeric as it is used by
# algorithms requiring weighted edges.
# Directed graphs
# The `DiGraph` class provides additional methods and properties specific
# to directed edges, e.g.,
# `DiGraph.out_edges`, `DiGraph.in_degree`,
# `[Link]()`, `[Link]()` etc.
# To allow algorithms to work with both classes easily, the directed
versions of
# `neighbors()` is equivalent to `successors()` while `degree` reports
80
# the sum of `in_degree` and `out_degree` even though that may feel Assess Superstep
# inconsistent at times.
DG = [Link]()
DG.add_weighted_edges_from([(1, 2, 0.5), (3, 1, 0.75)])
DG.out_degree(1, weight='weight')
[Link](1, weight='weight')
list([Link](1))
list([Link](1))
# Some algorithms work only for directed graphs and others are not well
# defined for directed graphs. Indeed the tendency to lump directed
# and undirected graphs together is dangerous. If you want to treat
# a directed graph as undirected for some measurement you should
probably
# convert it using `Graph.to_undirected()` or with
H = [Link](G) # create an undirected graph H from a directed graph G
# # Multigraphs
# NetworkX provides classes for graphs which allow multiple edges
# between any pair of nodes. The `MultiGraph` and
# `MultiDiGraph`
# classes allow you to add the same edge twice, possibly with different
# edge data. This can be powerful for some applications, but many
# algorithms are not well defined on such graphs.
# Where results are well defined,
# e.g., `[Link]()` we provide the function. Otherwise you
# should convert to a standard graph in a way that makes the measurement
well defined
MG = [Link]()
MG.add_weighted_edges_from([(1, 2, 0.5), (1, 2, 0.75), (2, 3, 0.5)])
dict([Link](weight='weight'))
GG = [Link]()
81
Data Science for n, nbrs in [Link]():
for nbr, edict in [Link]():
minvalue = min([d['weight'] for d in [Link]()])
GG.add_edge(n, nbr, weight = minvalue)
nx.shortest_path(GG, 1, 3)
# # Graph generators and graph operations
# In addition to constructing graphs node-by-node or edge-by-edge, they
# can also be generated by
# 1. Applying classic graph operations, such as:
# 1. Using a call to one of the classic small graphs, e.g.,
# 1. Using a (constructive) generator for a classic graph, e.g.,
# like so:
K_5 = nx.complete_graph(5)
K_3_5 = nx.complete_bipartite_graph(3, 5)
barbell = nx.barbell_graph(10, 10)
lollipop = nx.lollipop_graph(10, 20)
# 1. Using a stochastic graph generator, e.g, like so:
er = nx.erdos_renyi_graph(100, 0.15)
ws = nx.watts_strogatz_graph(30, 3, 0.1)
ba = nx.barabasi_albert_graph(100, 5)
red = nx.random_lobster(100, 0.9, 0.9)
# 1. Reading a graph stored in a file using common graph formats,
# such as edge lists, adjacency lists, GML, GraphML, pickle, LEDA and
others.
nx.write_gml(red, "[Link]")
mygraph = nx.read_gml("[Link]")
# For details on graph formats see Reading and writing graphs
# and for graph generator functions see Graph generators
# # Analyzing graphs
82
# The structure of `G` can be analyzed using various graph-theoretic Assess Superstep
83
Data Science nx.draw_shell(G, nlist=[range(5, 10), range(5)], with_labels=True,
font_weight='bold')
# when drawing to an interactive display. Note that you may need to issue
a Matplotlib
[Link]()
options = {
'node_color': 'black',
'node_size': 100,
'width': 3,
}
[Link](221)
nx.draw_random(G, **options
[Link](222)
nx.draw_circular(G, **options)
[Link](223)
nx.draw_spectral(G, **options)
[Link](224)
nx.draw_shell(G, nlist=[range(5,10), range(5)], **options)
# You can find additional options via `draw_networkx()` and
# layouts via `layout`.
# You can use multiple shells with `draw_shell()`.
G = nx.dodecahedral_graph()
shells = [[2, 3, 4, 5, 6], [8, 1, 0, 19, 18, 17, 16, 15, 14, 7], [9, 10, 11, 12,
13]]
nx.draw_shell(G, nlist=shells, **options)
# To save drawings to a file, use, for example
[Link](G)
[Link]("[Link]")
# writes to the file `[Link]` in the local directory.
84
Output: Assess Superstep
G = nx.petersen_graph()
[Link](121)
[Link](G, with_labels=True, font_weight='bold')
[Link](122)
nx.draw_shell(G, nlist=[range(5, 10), range(5)], with_labels=True,
font_weight='bold')
Figure 2c1.4
[Link]()
options = {
'node_color': 'black',
'node_size': 100,
'width': 3,
}
[Link](221)
nx.draw_random(G, **options)
[Link](222)
nx.draw_circular(G, **options)
[Link](223)
nx.draw_spectral(G, **options)
85
Data Science [Link](224)
nx.draw_shell(G, nlist=[range(5,10), range(5)], **options
Figure 2c1.5
G = nx.dodecahedral_graph()
shells = [[2, 3, 4, 5, 6], [8, 1, 0, 19, 18, 17, 16, 15, 14, 7], [9, 10, 11, 12,
13]]
nx.draw_shell(G, nlist=shells, **options)
[Link](
Figure 2c1.6
[Link](G)
[Link]("[Link]")
86
Assess Superstep
Installation:
$ pip install schedule
[Link] class:
● [Link](interval=1) : Calls every on the default scheduler
instance. Schedule a new periodic job.
● schedule.run_pending() : Calls run pending on the default scheduler
instance. Run all jobs that are scheduled to run.
Parameters:
● interval: A quantity of a certain time unit
● scheduler: The Scheduler instance that this job will register itself
with once it has been fully configured in [Link]().
● run() : Run the job and immediately reschedule it. Returns: The
return value returned by the job_func
● to(latest) : Schedule the job to run at an irregular (randomized)
interval. For example, every(A).to(B).seconds executes the job
function every N seconds such that A <= N <= B.
For example
# Schedule Library imported
import schedule
import time
# Functions setup
def placement():
print("Get ready for Placement at various companies")
88
def good_luck(): Assess Superstep
89
Data Science
UNIT END QUESTIONS
1. Write Python program to create the network routing diagram from the
given data.
2. Write a Python program to build directed acyclic graph.
3. Write a Python program to pick the content for Bill Boards from the
given data.
4. Write a Python program to generate visitors data from the given csv
file.
REFERENCES
● Python for Data Science For Dummies, by Luca Massaron John Paul
Mueller (Author),
● Python for Data Analysis: Data Wrangling with Pandas, NumPy, and
IPython, 2nd Edition by William McKinney (Author), ISBN-13 :
978-9352136414 , Shroff/O'Reilly
*****
90
MODULE 2
Unit 3
3a
PROCESS SUPERSTEP
Unit Structure
3a.0 Objectives
3a.1 Introduction
3a.2 Data Vault
3a.2.1 Hubs
3a.2.2 Links
3a.2.3 Satellites
3a.2.4 Reference Satellites
3a.3 Time-Person-Object-Location-Event Data Vault\
3a.4 Time Section
3a.4.1 Time Hub
3a.4.2 Time Links
3a.4.3 Time Satellites
3a.5 Person Section
3a.5.1 Person Hub
3a.5.2 Person Links
3a.5.3 Person Satellites
3a.6 Object Section
3a.6.1 Object Hub
3a.6.2 Object Links
3a.6.3 Object Satellites
3a.7 Location Section
3a.7.1 Location Hub
3a.7.2 Location Links
3a.7.3 Location Satellites
3a.8 Event Section
3a.8.1 Event Hub
3a.8.2 Event Links
91
Data Science 3a.8.3 Event Satellites
3a.9 Engineering a Practical Process Superstep
3a.9.1 Event
3a.9.2 Explicit Event
3a.9.3 Implicit Event
3a.10 5-Whys Technique
3a.10.1 Benefits of the 5 Whys
3a.10.2 When Are the 5 Whys Most Useful?
3a.10.3 How to Complete the 5 Whys
3a.11 Fishbone Diagrams
3a.12 Carlo Simulation
3a.13 Causal Loop Diagrams
3a.14 Pareto Chart
3a.15 Correlation Analysis
3a.16 Forecasting
3a.17 Data Science
3a.0 OBJECTIVES
The objective of this chapter to learn Time-Person-Object-Location-
Event(T-P-O-L-E) design principle and various concepts that are use to
create/define relationship among this data.
3a.2 INTRODUCTION
The Process superstep uses the assess results of the retrieve versions of the
data sources into a highly structured data vault. These data vaults form the
basic data structure for the rest of the data science steps.
The Process superstep is the amalgamation procedure that pipes your data
sources into five primary classifications of data.
92
Process Superstep
3a.2 DATA VAULT
Data Vault modelling is a technique to manage long term storage of data
from multiple operation system. It store historical data in the database.
3a.2.1 Hubs:
Data vault hub is used to store business key. These keys do not change
over time. Hub also contains a surrogate key for each hub entry and
metadata information for a business key.
3a.2.2 Links:
Data vault links are join relationship between business keys.
3a.2.3 Satellites:
Data vault satellites stores the chronological descriptive and characteristics
for a specific section of business data. Using hub and links we get model
structure but no chronological characteristics. Satellites consist of
characteristics and metadata linking them to their specific hub.
3a.3 TIME-PERSON-OBJECT-LOCATION-EVENT
DATA VAULT
We will use Time-Person-Object-Location-Event (T-P-O-L-E) design
principle.
All five sections are linked with each other, resulting into sixteen links.
93
Data Science
3a.4 TIME SECTION
Time section contain data structure to store all time related information.
For example, time at which event has occurred.
3a.4.1Time Hub:
This hub act as connector between time zones.
Following are the fields of time hub.
● Time-Person Link
• This link connects date-time values from time hub to person hub.
• Dates such as birthdays, anniversaries, book access date, etc.
94
● Time-Object Link Process Superstep
• This link connects date-time values from time hub to object hub.
• Dates such as when you buy or sell car, house or book, etc.
● Time-Location Link
• This link connects date-time values from time hub to location hub.
• Dates such as when you moved or access book from post code, etc.
● Time-Event Link
• This link connects date-time values from time hub to event hub.
• Dates such as when you changed vehicles, etc.
Time satellite can be used to move from one time zone to other very
easily. This feature will be used during Transform superstep.
95
Data Science 3a.5.2 Person Links:
Person Links connect person hub to other hubs.
● Person-Time Link
• This link contains relationship between person hub and time hub.
● Person-Object Link
• This link contains relationship between person hub and object hub.
● Person-Location Link
• This link contains relationship between person hub and location hub.
● Person-Event Link
• This link contains relationship between person hub and event hub.
96
Process Superstep
● Object-Time Link
• This link contains relationship between Object hub and time hub.
● Object-Person Link
• This link contains relationship between Object hub and Person hub.
● Object-Location Link
• This link contains relationship between Object hub and Location
hub.
● Object-Event Link
• This link contains relationship between Object hub and event hub.
98
3a.7.2 Location Links: Process Superstep
● Location-Time Link
• This link contains relationship between location hub and time hub.
● Location-Person Link
• This link contains relationship between location hub and person hub.
● Location-Object Link
• This link contains relationship between location hub and object hub.
● Location-Event Link
• This link contains relationship between location hub and event hub.
99
Data Science
3a.8 EVENT SECTION
It contains data structure to store all data of entities related to event that
has occurred.
● Event-Time Link
• This link contains relationship between event hub and time hub.
100
● Event-Person Link Process Superstep
• This link contains relationship between event hub and person hub.
● Event-Object Link
• This link contains relationship between event hub and object hub.
● Event-Location Link
• This link contains relationship between event hub and location hub.
Year
The standard uses four digits to represent year. The values ranges from
0000 to 9999.
AD/BC requires conversion
Year Conversion
N AD Year N
3 AD Year 3
1 AD Year 1
1 BC Year 0
2 BC Year – 1
2020AD +2020
2020BC -2019 (year -1 for BC)
Table 3a.1
from datetime import datetime
from pytz import timezone, all_timezones
101
Data Science now_date = datetime(2020,1,2,3,4,5,6)
now_utc=now_date.replace(tzinfo=timezone('UTC'))
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Year:',str(now_utc.strftime("%Y")))
Output:
Month:
The standard uses two digits to represent month. The values ranges from
01 to 12.
The rule for a valid month is 12 January 2020 becomes 2020-11-12.
Above program can be updated to extract month value.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Month:',str(now_utc.strftime("%m")))
print('Month Name:',str(now_utc.strftime("%B")))
Output:
Number Name
01 January
02 February
03 March
04 April
05 May
06 June
04.1 July
102
Process Superstep
08 August
09 September
10 October
11 November
12 December
Table 3a.2
Day
The standard uses two digits to represent month. The values ranges from
01 to 31.
The rule for a valid month is 22 January 2020 becomes 2020-01-22 or
+2020-01-22.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Day:',str(now_utc.strftime("%d")))
Output:
Hour:
The standard uses two digits to represent hour. The values ranges from 00
to 24.
The valid format is hhmmss or hh:mm:ss. The shortened format hhmm or
hh:mm is accepted
The use of [Link] is the beginning of the calendar day. The use of
[Link] is only to indicate the end of the calendar day.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Hour:',str(now_utc.strftime("%H")))
Output:
103
Data Science Minute:
The standard uses two digits to represent minute. The values ranges from
00 to 59.
The standard minute must use two-digit values within the range of 00
through 59.
The valid format is hhmmss or hh:mm:ss.
Output:
Second:
The standard uses two digits to represent second. The values ranges from
00 to 59.
The valid format is hhmmss or hh:mm:ss.
print('Date:',str(now_utc.strftime("%Y-%m-%d %H:%M:%S (%Z)
(%z)")))
print('Second:',str(now_utc.strftime("%S")))
Output:
Output:
3a.9.1 Event:
This structure records any specific event or action that is discovered in the
data sources. Anevent is any action that occurs within the data sources.
Events are recorded using three main data entities: Event Type, Event
Group, and Event Code. The details of each event are recorded as a set of
details against the event code. There are two main types of events.
3a.9.2 Explicit Event:
This type of event is stated in the data source clearly and with full details.
There is cleardata to show that the specific action was performed.
Following are examples of explicit events:
• A security card with number 1234 was used to open door A.
• You are reading Chapter 9 of Practical Data Science.
• I bought ten cans of beef curry.
Explicit events are the events that the source systems supply, as these have
directdata that proves that the specific action was performed.
105
Data Science The following are examples of implicit events:
• A security card with number 8884.1 was used to open door X.
• A security card with number 8884.1 was issued to Mr. Vermeulen.
• Room 302 is fitted with a security reader marked door X.
These three events would imply that Mr. Vermeulen entered room 302 as
an event. Not true!
Example:
Problem Statement: Customers are unhappy because they are being
shipped products that don’t meet their specifications.
1. Why are customers being shipped bad products?
• Because manufacturing built the products to a specification that is
different from what the customer and the salesperson agreed to.
2. Why did manufacturing build the products to a different
specification than that of sales?
106
• Because the salesperson accelerates work on the shop floor by calling Process Superstep
107
Data Science
108
Process Superstep
109
Data Science
3a.16 FORECASTING
Forecasting is the ability to project a possible future, by looking at
historical data. The data vault enables these types of investigations, owing
to the complete history it collects as it processes the source’s systems data.
You will perform many forecasting projects during your career as a data
scientist and supply answers to such questions as the following:
• What should we buy?
• What should we sell?
• Where will our next business come from?
People want to know what you calculate to determine what is about to
happen
110
Process Superstep
3a.17 DATA SCIENCE
Data Science work best when approved techniques and algorithms are
followed.
After performing various experiments on data, the result must be verified
and it must have support.
Data sciences that work follow these steps:
Step 1: It begins with a question.
Step 2: Design a model, select prototype for the data and start a virtual
simulation. Some statistics and mathematical solutions can be
added to start a data science model.
All questions must be related to customer's business, such a way
that answer must provide an insight of business.
Step3: Formulate a hypothesis based on collected observation. Based on
model process the observation and prove whether hypothesis is
true or false.
Step4: Compare the above result with the real-world observations and
provide these results to real-life business.
Step 5: Communicate the progress and intermediate results with
customers and subject expert and involve them in the whole
process to ensure that they are part of journey of discovery.
SUMMARY
The Process superstep uses the assess results of the retrieve process from
the data sources into a highly structured data vaults that acts as basic data
structure for the remaining data science steps.
111
Data Science 9. Explain the Event section of TPOLE.
10. Explain the different date and time formats. What is leap year?
Explain.
11. What is an event? Explain explicit and implicit events.
12. How to Complete the 5 Whys?
13. What is a fishbone diagram? Explain with example.
14. Explain the significance of Monte Carlo Simulation and Causal Loop
Diagram.
15. What are pareto charts? What information can be obtained from pareto
charts?
16. Explain the use of correlation and forecasting in data science.
17. State and explain the five steps of data science.
REFERENCES
[Link]
[Link]
[Link]
[Link]
[Link]
*****
112
3b
TRANSFORM SUPERSTEP
Unit Structure
3b.0 Objectives
3b.1 Introduction
3b.2 Dimension Consolidation
3b.3 Sun Model
3b.3.1 Person-to-Time Sun Model
3b.3.2 Person-to-Object Sun Model
3b.3.3 Person-to-Location Sun Model
3b.3.4 Person-to-Event Sun Model
3b.3.5 Sun Model to Transform Step
3b.4 Transforming with Data Science
3b.5 Common Feature Extraction Techniques
3b.5.1 Binning
3b.5.2 Averaging
3b.6 Hypothesis Testing
3b.6.1 T-Test
3b.6.2 Chi-Square Test
3b.7 Overfitting & Underfitting
3b.7.1 Polynomial Features
3b.7.2 Common Data-Fitting Issue
3b.8 Precision-Recall
3b.8.1 Precision-Recall Curve
3b.8.2 Sensitivity & Specificity
3b.8.3 F1-Measure
3b.8.4 Receiver Operating Characteristic (ROC) Analysis Curves
3b.9 Cross-Validation Test
3b.10 Univariate Analysis
3b.11 Bivariate Analysis
3b.12 Multivariate Analysis
3b.13 Linear Regression
3b.13.1 Simple Linear Regression
113
Data Science 3b.13.2 RANSAC Linear Regression
3b.13.3 Hough Transform
3b.14 Logistic Regression
3b.14.1 Simple Logistic Regression
3b.14.2 Multinomial Logistic Regression
3b.14.3 Ordinal Logistic Regression
3b.15 Clustering Techniques
3b.15.1 Hierarchical Clustering
3b.15.2 Partitional Clustering
3b.16 ANOVA
Decision Trees
3b.0 OBJECTIVES
The objective of this chapter is to learn data transformation techniques,
feature extraction techniques, missing datahandling, and various
techniques to categorise data into suitable groups.
3b.1 INTRODUCTION
The Transform Superstep allow us to take data from data vault and answer
the questions raised by the investigation.
It takes standard data science techniques and methods to attain insight and
knowledge about the data that then can be transformed into actionable
decisions. These results can be explained to non-data scientist.
The Transform Superstep uses the data vault from the process step as its
source data.
114
Transform Superstep
115
Data Science The sun model is constructed to show all the characteristics from the two
data vault hub categories you are planning to extract. It explains how you
will create two dimensions and a fact via the Transform step from above
figure. You will create two dimensions (Person and Time) with one fact
(PersonBornAtTime) as shown in below figure,
116
Transform Superstep
117
Data Science import uuid
[Link].chained_assignment = None
############################################################
####
if [Link] == 'linux' or [Link] == ' Darwin':
Base=[Link]('~') + '/VKHCG'
else:
Base='C:/VKHCG'
print('################################')
print('Working Base :',Base, ' using ', [Link])
print('################################')
############################################################
####
Company='01-Vermeulen'
############################################################
####
sDataBaseDir=Base + '/' + Company + '/04-Transform/SQLite'
if not [Link](sDataBaseDir):
[Link](sDataBaseDir)
sDatabaseName=sDataBaseDir + '/[Link]'
conn1 = [Link](sDatabaseName)
############################################################
####
sDataWarehousetDir=Base + '/99-DW'
if not [Link](sDataWarehousetDir):
[Link](sDataWarehousetDir)
sDatabaseName=sDataWarehousetDir + '/[Link]'
conn2 = [Link](sDatabaseName)
print('\n#################################')
print('Time Dimension')
BirthZone = 'Atlantic/Reykjavik'
118
BirthDateUTC = datetime(1960,12,20,10,15,0) Transform Superstep
BirthDateZoneUTC=[Link](tzinfo=timezone('UTC'))
BirthDateZoneStr=[Link]("%Y-%m-%d
%H:%M:%S")
BirthDateZoneUTCStr=[Link]("%Y-%m-%d
%H:%M:%S (%Z)
(%z)")
BirthDate = [Link](timezone(BirthZone))
BirthDateStr=[Link]("%Y-%m-%d %H:%M:%S (%Z) (%z)")
BirthDateLocal=[Link]("%Y-%m-%d %H:%M:%S")
############################################################
####
IDTimeNumber=str(uuid.uuid4())
TimeLine=[('TimeID', [IDTimeNumber]),
('UTCDate', [BirthDateZoneStr]),
('LocalTime', [BirthDateLocal]),
('TimeZone', [BirthZone])]
TimeFrame = [Link].from_items(TimeLine)
############################################################
####
DimTime=TimeFrame
DimTimeIndex=DimTime.set_index(['TimeID'],inplace=False)
sTable = 'Dim-Time'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
DimTimeIndex.to_sql(sTable, conn1, if_exists="replace")
DimTimeIndex.to_sql(sTable, conn2, if_exists="replace")
print('\n#################################')
print('Dimension Person')
print('\n#################################')
119
Data Science FirstName = 'Guðmundur'
LastName = 'Gunnarsson'
############################################################
###
IDPersonNumber=str(uuid.uuid4())
PersonLine=[('PersonID', [IDPersonNumber]),
('FirstName', [FirstName]),
('LastName', [LastName]),
('Zone', ['UTC']),
('DateTimeValue', [BirthDateZoneStr])]
PersonFrame = [Link].from_items(PersonLine)
############################################################
####
DimPerson=PersonFrame
DimPersonIndex=DimPerson.set_index(['PersonID'],inplace=False)
############################################################
####
sTable = 'Dim-Person'
print('\n#################################')
print('Storing :',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
DimPersonIndex.to_sql(sTable, conn1, if_exists="replace")
DimPersonIndex.to_sql(sTable, conn2, if_exists="replace")
print('\n#################################')
print('Fact - Person - time')
print('\n#################################')
IDFactNumber=str(uuid.uuid4())
PersonTimeLine=[('IDNumber', [IDFactNumber]),
('IDPersonNumber', [IDPersonNumber]),
('IDTimeNumber', [IDTimeNumber])]
120
PersonTimeFrame = [Link].from_items(PersonTimeLine) Transform Superstep
############################################################
####
FctPersonTime=PersonTimeFrame
FctPersonTimeIndex=FctPersonTime.set_index(['IDNumber'],inplace=Fal
se)
############################################################
####
sTable = 'Fact-Person-Time'
print('\n#################################')
print('Storing:',sDatabaseName,'\n Table:',sTable)
print('\n#################################')
FctPersonTimeIndex.to_sql(sTable, conn1, if_exists="replace")
FctPersonTimeIndex.to_sql(sTable, conn2, if_exists="replace")
Gunnarsson- [Link] in directory
3b.5.1 Binning:
Binning technique is used to reduce the complexity of data sets, to enable
the data scientist to evaluate the data with an organized grouping
technique.
Binning is a good way for you to turn continuous data into a data set that
has specific features that you can evaluate for patterns. For example, if
you have data about a group of people, you might want to arrange their
ages into a smaller number of age intervals (for example, grouping every
five years together).
import numpy
data = [Link](100)
bins = [Link](0, 1, 10)
digitized = [Link](data, bins)
bin_means = [data[digitized == i].mean() for i in range(1, len(bins))]
print(bin_means)
#The second is to use the histogram function.
bin_means2 = ([Link](data, bins, weights=data)[0] /
[Link](data, bins)[0])
print(bin_means2)
3b.5.2 Averaging:
The use of averaging enables you to reduce the amount of records you
require to report any activity that demands a more indicative, rather than a
precise, total.
Example:
Create a model that enables you to calculate the average position for ten
sample points. First, set up the ecosystem.
import numpy as np
122
import pandas as pd Transform Superstep
3b.6.1 T-Test:
The t-test is one of many tests used for the purpose of hypothesis testing in
statistics. A t-test is a popular statistical test to make inferences about
single means or inferences about two means or variances, to check if the
two groups’ means are statistically different from each other, where
n(sample size) < 30 and standard deviation is unknown.
123
Data Science The One Sample tTest determines whether the sample mean is statistically
different from a known or hypothesised population mean. The One
Sample tTest is a parametric test.
Example:
import numpy as np
import [Link] as plt
from sklearn.linear_model import Ridge
from [Link] import PolynomialFeatures
from [Link] import make_pipeline
def f(x):
""" function to approximate by polynomial interpolation"""
return x * [Link](x)
# generate points used to plot
x_plot = [Link](0, 10, 100)
# generate points and keep a subset of them
x = [Link](0, 10, 100)
rng = [Link](0)
[Link](x)
x = [Link](x[:20])
126
y = f(x) Transform Superstep
Example:
import numpy as np
import [Link] as plt
from [Link] import Pipeline
from [Link] import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
def true_fun(X):
return [Link](1.5 * [Link] * X)
[Link](0)
127
Data Science n_samples = 30
degrees = [1, 4, 15]
X = [Link]([Link](n_samples))
y = true_fun(X) + [Link](n_samples) * 0.1
[Link](figsize=(14, 5))
for i in range(len(degrees)):
ax = [Link](1, len(degrees), i + 1)
[Link](ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i],
include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline([("polynomial_features", polynomial_features),
("linear_regression", linear_regression)])
[Link](X[:, [Link]], y)
# Evaluate the models using crossvalidation
scores = cross_val_score(pipeline, X[:, [Link]], y,
scoring="neg_mean_squared_error", cv=10)
X_test = [Link](0, 1, 100)
[Link](X_test, [Link](X_test[:, [Link]]), label="Model")
[Link](X_test, true_fun(X_test), label="True function")
[Link](X, y, edgecolor='b', s=20, label="Samples")
[Link]("x")
[Link]("y")
[Link]((0, 1))
[Link]((-2, 2))
[Link](loc="best")
[Link]("Degree {}\nMSE = {:.2e}(+/- {:.2e})".format( degrees[i], -
[Link](), [Link]()))
[Link]()
128
Transform Superstep
3b.8 PRECISION-RECALL
Precision-recall is a useful measure for successfully predicting when
classes are extremely imbalanced. In information retrieval,
• Precision is a measure of result relevancy.
• Recall is a measure of how many truly relevant results are returned.
Precision (P) is defined as the number of true positives (Tp) over the
number of true positives (Tp) plus the number of false positives (Fp).
Recall (R) is defined as the number of true positives (Tp) over the
number of true positives (Tp) plus the number of false negatives (Fn).
The true negative rate (TNR) is the rate that indicates the recall of the
negative items.
129
Data Science 3b.8.2 Sensitivity & Specificity:
Sensitivity and specificity are statistical measures of the performance of a
binary classification test, also known in statistics as a classification
function. Sensitivity (also called the true positive rate, the recall, or
probability of detection) measures the proportion of positives that are
correctly identified as such (e.g., the percentage of sick people who are
correctly identified as having the condition). Specificity (also called the
true negative rate) measures the proportion of negatives that are correctly
identified as such (e.g., the percentage of healthy people who are correctly
identified as not having the condition).
3b.8.3 F1-Measure:
The F1-score is a measure that combines precision and recall in the
harmonic mean of precision and recall.
130
Transform Superstep
3b.9 CROSS-VALIDATION TEST
Cross-validation is a model validation technique for evaluating how the
results of a statistical analysis will generalize to an independent data set. It
is mostly used in settings where the goal is the prediction. Knowing how
to calculate a test such as this enables you to validate the application of
your model on real-world, i.e., independent data sets.
Example:
import numpy as np
from sklearn.model_selection import cross_val_score
from sklearn import datasets, svm
import [Link] as plt
digits = datasets.load_digits()
X = [Link]
y = [Link]
Let’s pick three different kernels and compare how they will perform.
kernels=['linear', 'poly', 'rbf']
for kernel in kernels:
svc = [Link](kernel=kernel)
C_s = [Link](-15, 0, 15)
scores = list()
scores_std = list()
for C in C_s:
svc.C = C
this_scores = cross_val_score(svc, X, y, n_jobs=1)
[Link]([Link](this_scores))
scores_std.append([Link](this_scores))
You must plot your results.
Title="Kernel:>" + kernel
fig=[Link](1, figsize=(4.2, 6))
[Link]()
[Link](Title, fontsize=20)
131
Data Science [Link](C_s, scores)
[Link](C_s, [Link](scores) + [Link](scores_std), 'b--')
[Link](C_s, [Link](scores) - [Link](scores_std), 'b--')
locs, labels = [Link]()
[Link](locs, list(map(lambda x: "%g" % x, locs)))
[Link]('Cross-Validation Score')
[Link]('Parameter C')
[Link](0, 1.1)
[Link]()
Well done. You can now perform cross-validation of your results.
Table 3b.1
Suppose that the heights of seven students of a class is recorded (in the
above figure), there is only one variable that is height and it is not dealing
with any cause or relationship. The description of patterns found in this
type of data can be made by drawing conclusions using central tendency
measures (mean, median and mode), dispersion or spread of data (range,
minimum, maximum, quartiles, variance and standard deviation) and by
using frequency distribution tables, histograms, pie charts, frequency
polygon and bar charts.
132
Transform Superstep
Table 3b.2
Suppose the temperature and ice cream sales are the two variables of a
bivariate data (in the above figure). Here, the relationship is visible from
the table that temperature and sales are directly proportional to each other
and thus related because as the temperature increases, the sales also
increase. Thus, bivariate data analysis involves comparisons, relationships,
causes and explanations. These variables are often plotted on X and Y axis
on the graph for better understanding of data and one of these variables is
independent while the other is dependent.
133
Data Science • Real estate: A simple linear regression analysis can be used to model
residential home prices as a function of the home's living area. Such a
model helps set or evaluate the list price of a home on the market. The
model could be further improved by including other input variables
such as number of bathrooms, number of bedrooms, lot size, school
district rankings, crime statistics, and property taxes
• Demand forecasting: Businesses and governments can use linear
regression models to predict demand for goods and services. For
example, restaurant chains can appropriately prepare for the predicted
type and quantity of food that customers will consume based upon the
weather, the day of the week, whether an item is offered as a special,
the time of day, and the reservation volume. Similar models can be
built to predict retail sales, emergency room visits, and ambulance
dispatches.
• Medical: A linear regression model can be used to analyze the effect
of a proposed radiation treatment on reducing tumour sizes. Input
variables might include duration of a single radiation treatment,
frequency of radiation treatment, and patient attributes such as age or
weight.
134
b = slope of the line Transform Superstep
Figure 3b.8
3b.13.2 RANSAC Linear Regression:
RANSAC is an acronym for Random Sample Consensus. What this
algorithm does is fit a regression model on a subset of data that the
algorithm judges as inliers while removing outliers. This naturally
improves the fit of the model due to the removal of some data points. An
advantage of RANSAC is its ability to do robust estimation of the model
parameters, i.e., it can estimate the parameters with a high degree of
accuracy, even when a significant number of outliers are present in the
data set. The process will find a solution, because it is so robust.
The process that is used to determine inliers and outliers is described
below.
1. The algorithm randomly selects a random number of samples to be
inliers in the model.
2. All data is used to fit the model and samples that fall with a certain
tolerance are relabelled as inliers.
3. Model is refitted with the new inliers.
4. Error of the fitted model vs the inliers is calculated.
5. Terminate or go back to step 1 if a certain criterion of iterations or
performance is not met.
135
Data Science maxima in a so-called accumulator space that is explicitly constructed by
the algorithm for computing the Hough transform.
With the help of the Hough transformation, this regression improves the
resolution of the RANSAC technique, which is extremely useful when
using robotics and robot vision in which the robot requires the regression
of the changes between two data frames or data sets to move through an
environment.
Simple logistic regression can be used when you have one nominal
variable with two values (male/female, dead/alive, etc.) and one
measurement variable. The nominal variable is the dependent variable,
and the measurement variable is the independent variable. Logistic
Regression, also known as Logit Regression or Logit Model. Logistic
Regression works with binary data, where either the event happens (1) or
the event does not happen (0).
In linear regression modelling, the outcome variable is a continuous
variable. When the outcome variable is categorical in nature, logistic
regression can be used to predict the likelihood of an outcome based on
the input variables. Although logistic regression can be applied to an
outcome variable that represents multiple values, but we will examine the
case in which the outcome variable represents two values such as
true/false, pass/fail, or yes/no.
Simple logistic regression is analogous to linear regression, except that the
dependent variable is nominal, not a measurement. One goal is to see
whether the probability of getting a particular value of the nominal
variable is associated with the measurement variable; the other goal is to
predict the probability of getting a particular value of the nominal variable,
given the measurement variable.
For example, a logistic regression model can be built to determine if a
person will or will not purchase a new automobile in the next 12 months.
The training set could include input variables for a person's age, income,
and gender as well as the age of an existing automobile. The training set
would also include the outcome variable on whether the person purchased
a new automobile over a 12-month period. The logistic regression model
provides the likelihood or probability of a person making a purchase in the
next 12 months.
Logistics Regression is based on the logistics function f(y), as given in the
equation below,
137
Data Science predict the dependent variable. Multinomial Logistic Regression is the
regression analysis to conduct when the dependent variable is nominal
with more than two levels.
For example, you could use multinomial logistic regression to understand
which type of drink consumers prefer based on location in the UK and age
(i.e., the dependent variable would be "type of drink", with four categories
– Coffee, Soft Drink, Tea and Water – and your independent variables
would be the nominal variable, "location in UK", assessed using three
categories – London, South UK and North UK – and the continuous
variable, "age", measured in years). Alternately, you could use
multinomial logistic regression to understand whether factors such as
employment duration within the firm, total employment duration,
qualifications and gender affect a person's job position (i.e., the dependent
variable would be "job position", with three categories – junior
management, middle management and senior management – and the
independent variables would be the continuous variables, "employment
duration within the firm" and "total employment duration", both measured
in years, the nominal variables, "qualifications", with four categories – no
degree, undergraduate degree, master's degree and PhD – "gender", which
has two categories: "males" and "females").
140
3b.15.2 Partitional Clustering: Transform Superstep
3b.16 ANOVA
The ANOVA test is the initial step in analysing factors that affect a given
data set. Once the test is finished, an analyst performs additional testing on
the methodical factors that measurably contribute to the data set's
inconsistency. The analyst utilizes the ANOVA test results in an f-test to
generate additional data that aligns with the proposed regression models.
The ANOVA test allows a comparison of more than two groups at the
same time to determine whether a relationship exists between them.
Example:
A BOGOF (buy-one-get-one-free) campaign is executed on 5 groups of
100 customers each. Each group is different in terms of its demographic
attributes. We would like to determine whether these five respond
differently to the campaign. This would help us optimize the right
campaign for the right demographic group, increase the response rate, and
reduce the cost of the campaign.
The analysis of variance works by comparing the variance between the
groups to that within the group. The core of this technique lies in assessing
whether all the groups are in fact part of one larger population or a
completely different population with different characteristics.
141
Data Science The Formula for ANOVA is:
There are two types of ANOVA: one-way (or unidirectional) and two-
way. One-way or two-way refers to the number of independent variables
in your analysis of variance test. A one-way ANOVA evaluates the impact
of a sole factor on a sole response variable. It determines whether all the
samples are the same. The one-way ANOVA is used to determine whether
there are any statistically significant differences between the means of
three or more independent (unrelated) groups.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-
way, you have one independent variable affecting a dependent variable.
With a two-way ANOVA, there are two independents. For example, a
two-way ANOVA allows a company to compare worker productivity
based on two independent variables, such as salary and skill set. It is
utilized to observe the interaction between the two factors and tests the
effect of two factors at the same time.
142
Decision trees have two varieties: classification trees and regression Transform Superstep
Example:
143
Data Science as people who would purchase the product. In traversing this tree, age
does not matter for females, and income does not matter for males.
SUMMARY
The Transform superstep allows us to take data from the data vault and
formulate answers to questions raised by the investigations. The
transformation step is the data science process that converts results into
insights.
144
10. Explain over fitting and underfitting. Discuss the common fitting Transform Superstep
issues.
11. Explain precision recall, precision recall curve, sensitivity,
specificity and F1 measure.
12. Explain Univariate Analysis.
13. Explain Bivariate Analysis.
14. What is Linear Regression? Give some common application of linear
regression in the real world.
15. What is Simple Linear Regression? Explain.
16. Write a note on RANSAC Linear Regression.
17. Write a note on Logistic Regression.
18. Write a note on Simple Logistic Regression.
19. Write a note on Multinomial Logistic Regression.
20. Write a note on Ordinal Logistic Regression.
21. Explain CLustering techniques.
22. Explain Receiver Operating Characteristic (ROC) Analysis Curves
and cross validation test.
23. Write a note on ANOVA.
24. Write a note on Decision Trees.
REFERENCES
[Link]
[Link]
[Link]
[Link]
[Link]
*****
145
Unit 4
Machine Learning for Data Science
4a
TRANSFORM SUPERSTEP
Unit Structure
4a.0 Objectives
4a.1 Introduction
4a.2 Overview
4a.3 Dimension Consolidation
4a.4 The SUN Model
4a.5 Transforming with data science
4a.5.1 Missing value treatment
4a.5.2 Techniques of outlier detection and Treatment
4a.6 Hypothesis testing
4a.7 Chi-square test
4a.8 Univariate Analysis.
4a.9 Bivariate Analysis
4a.10 Multivariate Analysis
4a.11 Linear Regression
4a.12 Logistic Regression
4a.13 Clustering Techniques
4a.14 ANOVA
4a.15 Principal Component Analysis (PCA)
4a.16 Decision Trees
4a.17 Support Vector Machines
4a.18 Networks, Clusters, and Grids
4a.19 Data Mining
4a.20 Pattern Recognition
4a.21 Machine Learning
4a.22 Bagging Data
4a.23 Random Forests
4a.24 Computer Vision (CV)
146
4a.25 Natural Language Processing (NLP) Transform Superstep
4a.0 OBJECTIVES
The objective of this chapter is to learn the data transformation where it
brings data to knowledge and converts results into insights.
4a.1 INTRODUCTION
The Transform superstepallows to take data from the data vault and
formulate answers to questions. The transformation step is the data science
process that converts results into meaningful insights.
4a.2 OVERVIEW
As to explain the below scenario is shown.
Data is categorised in to 5 different dimensions:
[Link] [Link] [Link]
4..Location 5. Event
Figure 4a.1
147
Data Science
4a.4 THE SUN MODEL
The use of sun models is a technique that enables the data scientist to
perform consistent dimension consolidation, by explaining the intended
data relationship with the business, without exposing it to the technical
details required to complete the transformation processing.
The sun model is constructed to show all the characteristics from the two
data vault hub categories you are planning to extract. It explains how you
will create two dimensions and a fact via the Transform step.
148
Transform Superstep
4a.6 HYPOTHESIS TESTING
Hypothesis testing is not precisely an algorithm, but it’s a must-know for
any datascientist. You cannot progress until you have thoroughly mastered
this [Link] testing is the process by which statistical tests
are used to check if ahypothesis is true, by using data. Based on
hypothetical testing, data scientists choose to accept or reject the
hypothesis. When an event occurs, it can be a trend or happen by chance.
To check whether the event is an important occurrence or
justhappenstance,hypothesis testing is necessary.
There are many tests for hypothesis testing, but the following two are most
popular.
T-test and Chi-Square test.
150
Transform Superstep
Linear regression is a useful tool for answering the first question, and
logistic regression is a popular method for addressing the second.
151
Data Science between them.
● Robustness
● Flexibility
● Efficiency
Clustering algorithms/Methods:
There are several clustering algorithms/Methods available, of which we
will be explaining a few:
● Connectivity Clustering Method: This model is based on the
connectivity between the data points. These models are based on the
notion that the data points closer in data space exhibit more similarity
to each other than the data points lying farther away.
● Clustering Partition Method: it works on divisions method, where
the division or partition between data set is created. These partitions
are predefined non-empty sets. This is suitable for a small dataset.
● Centroid Cluster Method: This model revolve around the centre
element of the dataset. The closest data points to the centre data point
(centroid) in the dataset is considered to form a cluster. K-Means
clustering algorithm is the best fit example of such model.
● Hierarchical clustering Method: This method describes the tree
based structure (nested clusters) of the clusters. In this method we have
clusters based on the divisions and their sub divisions in a hierarchy
(nested clustering). The hierarchy can be pre-determined based upon
152
user choice. Here number of clusters could remain dynamic and not Transform Superstep
4a.14 ANOVA
ANOVA is an acronym which stands for “ANalysisOfVAriance”. An
ANOVA test is a way to find out if survey or experiment results are
significant. In other words, they help you to figure out if you need to reject
the null hypothesis or accept the alternate hypothesis.
Basically, you’re testing groups to see if there’s a difference between
them. Examples of when you might want to test different groups:
● A group of psychiatric patients are trying three different therapies:
counselling, medication and biofeedback. You want to see if one
therapy is better than the others.
● A manufacturer has two different processes to make light bulbs. They
want to know if one process is better than the other.
● Students from different colleges take the same exam. You want to see
if one college outperforms the other.
Formula of ANOVA:
F= MST / MSE
Where, F = ANOVA Coefficient
MSE = Mean sum of squares due to treatment
MST = Mean sum of squares due to error
The ANOVA test is the initial step in analysing factors that affect a given
data set. Once the test is finished, an analyst performs additional testing on
the methodical factors that measurably contribute to the data set's
inconsistency. The analyst utilizes the ANOVA test results in an f-test to
generate additional data that aligns with the proposed regression models.
The ANOVA test allows a comparison of more than two groups at the
same time to determine whether a relationship exists between them. The
result of the ANOVA formula, the F statistic (also called the F-ratio),
allows for the analysis of multiple groups of data to determine the
variability between samples and within samples.
(citation: [Link]
153
Data Science [Link]
testing/anova/#:~:text=An%20ANOVA%20test%20is%20a,there's%20a%
20difference%20between%20them. )
3. Formal or informal?
4. Is it for a special occasion?
5. Which colour suits me better?
6. Which would be the most durable brand?
7. Shall we wait for some special sale or just buy one since its needed?
And similar more questions would give us a choice for selection. This
prediction works on the classification where the choice of outputs are
classified and the possibility of occurrence is decided on the basis of the
probability of the occurrence of that particular output.
Example:
155
Data Science Scene one:
Figure 4a.3
The above scene shows A, B and C as 3 line segments creating
hyper planes by dividing the plane. The graph here shows the 2 inputs
circles and stars. The inputs could be from two classes. Looking at the
scenario we can say that A is the line segment which is diving the 2 hyper
planes showing 2 different input classes.
Scene two:
Figure 4a.4
In the scene 2 we can see another rule, the one which cuts the
better halves is considered. Hence, hyper plane C is the best choice of the
algorithm.
156
Scene three: Transform Superstep
Figure 4a.5
Here in scene 3, we see one circle overlapping hyper plane A ,
hence according to rule 1 of scene 1 we will choose B which is cutting the
co-ordinates into 2 better halves.
Scene four:
Figure 4a.6
Scene 4 shows one hyper plane dividing the 2 better halves but
there exist one extra circle co-ordinate in the other half hyperplane. We
call this as an outlier which is generally discarded by the algorithm.
Scene five:
Figure 4a.7
157
Data Science Scene 5 shows another strange scenario where we have the co-
ordinates at all 4 quadrants. In this scenario we will fold the x-axis and cut
y axis into two halves and transfer the stars and circle on one side of the
quadrant and simplify the solution. The representation is shown below:
Figure 4a.8
This gives us again a chance to divide the 2 classes into 2 better halves
with using a hyperplane. In the above scenario we have scooped out the
stars from the circle co-ordinates and shown it as a different hyperplane.
Neural Networks:
Artificial Neural Networks; the term is quite fascinating when any student
starts learning it. Let us break down the term and know its meaning.
Artificial = means “man made”,
Neural = comes from the term Neurons in the brain; a complex structure
of nerve cells with keeps the brain functioning. Neurons are the vital part
of human brain which does simple input/output to complex problem
solving in the brain.
Network = A connection of two entities (here in our case “Neurons”, not
just two but millions of them).
158
● There are around 100 billion neurons in our brain, which keeps our Transform Superstep
● A human brain hence can also store upto 1000 terabytes of data.
159
Data Science Here, the neuron is actually a processing unit, it calculates the weighted
sum of the input signal to the neuron to generate the activation signal a,
given by :
X1 W1
X2 W2
..... ..... a=
..Xn .Wn Sum of
inputs
∑
160
Transform Superstep
4a.18 TENSORFLOW
TensorFlow is an end-to-end open source platform for machine learning. It
has a comprehensive, flexible ecosystem of tools, libraries and community
resources that lets researchers push the state-of-the-art in ML and
developers easily build and deploy ML powered applications.
It is an open source artificial intelligence library, using data flow graphs to
build models. It allows developers to create large-scale neural networks
with many layers. TensorFlow is mainly used for: Classification,
Perception, Understanding, Discovering, Prediction and Creation.
More can be learnt from [Link]
*****
161
4b
ORGANIZE AND REPORT SUPERSTEPS
Unit Structure
4b.1 Organize Superstep
4b.2 Report Superstep
4b.3 Graphics, Pictures
4b.4 Unit End Questions
Organize Superstep, Report Superstep, Graphics, Pictures, Showing the
Difference
(citation: from the book: Practical Data Science by Andreas François
Vermeulen)
Horizontal Style:
Performing horizontal-style slicing or subsetting of the data warehouse is
achieved by applying a filter technique that forces the data warehouse to
show only the data for a specific preselected set of filtered outcomes
against the data population. The horizontal-style slicing selects the subset
of rows from the population while preserving the columns.
That is, the data science tool can see the complete record for the records in
the subset of records.
Vertical Style:
Performing vertical-style slicing or subsetting of the data warehouse is
achieved by applying a filter technique that forces the data warehouse to
show only the data for specific preselected filtered outcomes against the
data population. The vertical-style slicing selects the subset of columns
from the population, while preserving the rows.
That is, the data science tool can see only the preselected columns from a
record for all the records in the population.
162
LogicalData
Agent
Science
Island Style:
Performing island-style slicing or subsetting of the data warehouse is
achieved by applying a combination of horizontal- and vertical-style
slicing. This generates a subset of specific rows and specific columns
reduced at the same time.
item set mining and association rule learning over the content of the data
lake. It proceeds by identifying the frequent individual items in the data
lake and extends them to larger and larger item sets, as long as those item
sets appear satisfactorily frequently in the data lake.
The frequent item sets determined by Apriorican be used to determine
association rules that highlight common trends in the overall data lake. I
will guide you through an example.
Appropriate Visualization:
It is true that a picture tells a thousand words. But in data science, you
only want your visualizations to tell one story: the findings of the data
science you prepared. It is absolutely necessity to ensure that your
audience will get your most important message clearly and without any
other meanings.
Practice with your visual tools and achieve a high level of proficiency. I
have seen numerous data scientists lose the value of great data science
results because they did not perform an appropriate visual presentation.
Eliminate Clutter:
164
LogicalData
Agent
Science
Have you ever attended a presentation where the person has painstakingly
prepared 50 slides to feedback his data science results? The most painful
image is the faces of the people suffering through such a presentation for
over two hours.
The biggest task of a data scientist is to eliminate clutter in the data sets.
There are various algorithms, such as principal component analysis
(PCA), multicollinearity using the variance inflation factor to eliminate
dimensions and impute or eliminate missing values, decision trees to
subdivide, and backward feature elimination, but the biggest contributor to
eliminating clutter is good and solid feature engineering.
165