Data Science
Data Science
.
(SEWA)
SKILL ENHANCEMENT
EMPLOYABILITY
WISDOM ALL COPYR
ACCESSIBILITY
INSTRUCTIONAL STUDY MATERIAL FOR JGND PSOU
B.Sc.(Data Science)
Semester I
BSDB31103T
-SELF Introduction to Data Science
The University reserves all the copyrights of the study material. No part of this
publication may be reproduced or transmitted in any form.
Sr. Name
No.
1 Dr. Preety Dubey
2 Dr. Chetan Dudhagara
3 Mr. Sandeep Kumar
JAGAT GURU NANAK DEV PUNJAB STATE OPEN UNIVERSITY, PATIALA
(Established by Act No. 19 of 2019 of the Legislature of State of Punjab)
PREFACE
Jagat Guru Nanak Dev Punjab State Open University, Patiala was established in December
2019 by Act 19 of the Legislature of State of Punjab. It is the first and only Open University
of the State, entrusted with the responsibility of making higher education accessible to all,
especially to those sections of society who do not have the means, time or opportunity to
pursue regular education.
In keeping with the nature of an Open University, this University provides a flexible
education system to suit every need. The time given to complete a programme is double the
duration of a regular mode programme. Well-designed study material has been prepared in
consultation with experts in their respective fields.
The University offers programmes which have been designed to provide relevant, skill-based
and employability-enhancing education. The study material provided in this booklet is
selfinstructional, with self-assessment exercises, and recommendations for further readings.
The syllabus has been divided in sections, and provided as units for simplification.
To provide strong foundation for data science and application area related to it and
understand the underlying core concepts and emerging technologies in data science.
Understand data analysis techniques for applications handling large data.
Unit I: Data Science-a Discipline, Landscape-Data to Data science, Data Growth-issues and
challenges, data science process. foundations of data science. Messy data, Anomalies and
artefacts in datasets. Cleaning data.
Unit II: Data Acquisition and Processing: introduction, Structured Vs Unsructured data,
data preprocessing techniques including data cleaning, selection, integration, transformation
and reduction, data mining, interpretation.
Unit III: Representation of Data: Special types-acoustic, image, sensor and network data.
Problems when handling large data – General techniques for handling large data, Distributing
data storage and processing with Frameworks
Unit IV: Data Science Ethics – Doing good data science – Owners of the data - Valuing
different aspects of privacy - Getting informed consent - The Five Cs – Diversity – Inclusion
– Future Trends.
Section B
Unit V: Data Wrangling Combining and Merging Data Sets – Reshaping and Pivoting –
Data Transformation – String manipulations – Regular Expressions
Unit VI: Data Aggregation and Group Operations Group By Mechanics – Data
Aggregation – GroupWise Operations – Transformations – Pivot Tables – Cross Tabulations
– Date and Time data types.
Unit VII: Data Modeling: Basics of Generative modeling and Predictive modeling.
Chartshistograms, scatter plots, time series plots etc. Graphs, 3D Visualization and
Presentation.
Unit VIII: Applications of Data Science: Business, Insurance, Energy, Health care,
Biotechnology, Manufacturing, Utilities, Telecommunication, Travel, Governance, Gaming,
Pharmaceuticals, Geospatial analytics and modeling
Suggested Readings
1. Sinan Ozdemir, Principles of Data Science, Packt Publishing, 2016
2. Joel Grus: Data Science from Scratch, O’Reilly, 2016
3. Foster Provost & Tom Fawcett: Data Science for Business O’Reilly, 2013
4. Roger D. Peng & Elizabeth Matsui: The Art of Data Science, Lean Publishing, 2015
5. Peter Bruce, Andrew Bruce, Peter Gedeck, Practical Statistics for Data Scientists,
2e:
50+ Essential Concepts Using R and Python, O′Reilly
STRUCTURE
1.0 Objectives
1.11 Summary
1.0 OBJECTIVES
1
1.1 DATA SCIENCE-A DISCIPLINE
Data Science is a blend of various tools, algorithms and machine learning principles with the
goal to discover hidden patterns from raw data. It is related to data mining, machine learning
and Big Data It is an area that manages, manipulates, extracts, and interprets knowledge from
tremendous amount of data. Data science (DS) is a multidisciplinary field of study with a
goal to address the challenges in big data. Big Data has given rise to Data Science and is a
therefore, a multidisciplinary field of study with goal to address the challenges in big data.
Following are some main reasons for using data science:
• Data Science is useful in making the huge amount of raw and unstructured data into
meaningful and useful data.
• Major companies like Google, Amazon, Netflix, etc, are using data science to handle
the huge data, to improve on the better customer experience by being able to analyze
and make predictions about trends and customer interests.
• Data science is used in almost every domain, therefore increasing the demand of a
data science expert in each domain, leading to increased employability.
2
same time, it brings challenges like data quality, storage, lack of data scientists and data
analysts. Big Data is characterized by five V‘s namely volume, variety, velocity, veracity and
value. Consequently, the challenges these characteristics bring are being seen in data capture,
curation, storage, search, sharing, transfer, and visualization. Each of the characteristic og
the big data is a challenge in itself. The following section discusses the various characteristics
of big data:
i. Volume refers to the enormous size of data. Big data refers to data volumes in the
range of exabytes and beyond e.g. In the year 2016, the estimated global mobile
traffic was 6.2 Exabytes (6.2 billion GB) per month. Also, by the year 2020 we will
have almost 40000 Exabytes of data. One of the most pressing challenges of Big Data
is storing all these huge sets of data properly. The amount of data being stored in data
centers and databases of companies is increasing rapidly. As these data sets grow
exponentially with time, which gets extremely difficult to handle.Most of the data is
unstructured and comes from documents, videos, audios, text files and other sources.
Such volumes exceed the capacity of current on-line storage systems and processing
systems. Traditional database systems is not able to capture, store and analyse this
large amount of data. Therefore, as the volume of the data increases, storage becomes
an issue. To handle large volume of data, techniques like compression, tiering, and
deduplication are being used. Compression is used for reducing the number of bits in
the data, thus reducing its overall size. Deduplication is the process of removing
duplicate and unwanted data from a data set.Storage solutions have been provided by
HADOOP framework, but represent long term challenges that require research and
new paradigms.
ii. Velocity refers to the high speed of accumulation of data. There is a massive and
continuous flow of data from sources like machines, networks, social media, mobile
phones etc. This determines the potential of data that how fast the data is generated
and processed to meet the demands e.g. there are more than 3.5 billion searches per
day are made on Google. FaceBook users are also increasing by 22% (approx.) year
by year. Data is streaming in, at unprecedented speed and must be dealt with in a
timely manner. RFID tags, sensors and smart metering are driving the need to deal
with torrents of data in near-real time. The data has to be available at the right time to
make business decisions accurately. Reacting quickly enough to deal with data
velocity is a challenge for most organizations.
iii. Variety refers to nature of data. Data is available in varied formats and heterogenous
sources. It can be structured, semi-structured and unstructured data. It is a challenge
to find ways of governing, merging and managing these diverse forms of data.
3
records, health records, bank statements, stock information, and
geolocations etc.
• Semi-structured Data: This kind of data is semi organized i.e. it has
no formal structure; it has features of both structured and unstructured
data. It has a flexible structure, which means that data of any structure
can be captured without making changes to the database schema or
code. The functionality of the database is not affected by adding or
removing data.While semi-structured data increases flexibility, the lack
of a fixed schema also creates challenges in storage and indexing. The
challenges increase as the volume of such data increases.Log files are
categorized as semi-structured data. Other examples include emails,
zipped files and webpages etc.
• Unstructured Data: This kind of data does not map to any standard. It
cannot be fit into any standard fields. E.g. the data on social media
platforms is unstructured data. It may contain both images and videos.
Other examples of unstructured data are photos, video and audio files,
social media content, satellite imagery, presentations, PDFs,
openended survey responses, websites, data from IoT devices, mobile
data, weather data, and conversation transcripts.
iv. Veracity here means quality of data, i.e. the extent to which data being used to
analyse a problem is meaniful. In other words, by veracity, it meant how accurate the
data is, or is the data trustworthy to be used for analysis. It refers to inconsistencies
and uncertainty in data. The available data may be messy and thus controlling the
quality and accuracy becomes another challenge. The analysis of a problem is totally
dependent on the collected data. If the gathered data is inaccurate or incomplete or
has any missing values, then it may not be able to provide real or valuable insight
about the problem.Therefore, Veracity refers to the level of trust in the collected data.
It is big challenge , because the accuracy of the analysis depends on the accuracy of
the colleceted data. Therefore, it is very important to clean the data to get accurate
analysis of data.
v. Variability: Variabilty means the extent to which the data varies or in other words it
is the variation in data. In addition to the increasing velocities and varieties of data,
data flows can be highly inconsistent with periodic peaks. Big Data is also variable
because of the multitude of data dimensions resulting from multiple disparate data
types and sources e.g. Data in bulk could create confusion whereas less amount of
data could convey half or Incomplete Information. The variation in data should be
minimum i.e the difference between the real and idle situation must be minimum.
Variability of data can be challenging to manage.
vi. Value: The bulk data having no value is of no good to the company, unless you turn
it into something useful. Data in itself is of no use or importance but it needs to be
4
converted into something valuable to extract information. The ability to transform the
bulk data into business and make this enormous data of use to business, for
monetization. It is a challenge to connect and correlate relationships, hierarchies and
multiple data linkages of the big data.
Among all the other concerns, the biggest challenge in data science ids the security issue.
The most common data security issue is data theft, especially for organizations that maintain
sensitive data like financial information or customers‘ personal information. With the
exponential increase in the usage of internet, enormous data is available , hence, data security
is the of utmost importance. Hence, companies need to follow the three fundamentals of data
security namely Confidentiality, Integrity and Accessibility.
1. Statistics: To analyse the large amount of numerical data and to find the meaningful
insights from it, knowledge of statistics is required. Math and statistics expertise is a
fundamental component of data science that allows practitioners to find meaningful patterns
in data that yield actionable insights.
5
5. Advanced computing: Advanced computing involves designing, writing, debugging,
and maintaining the source code of computer programs. Machine Learning and Deep learning
techniques are required for modelling and to make predictions about unforeseen/future data.
• Data Analysis tools: R, Python, Statistics, SAS, Jupyter, R Studio, MATLAB, Excel,
RapidMiner.
• Data Warehousing: ETL, SQL, Hadoop, Informatica/Talend, AWS Redshift Data
Visualization tools: R, Jupyter, Tableau, Cognos.
• Machine learning tools: Spark, Mahout, Azure ML studio.
Data Science has created a strong foothold in several industries.Some case studies, not
limited, where data science has been applied and has shown tremendous results are
pharmaceuticals industries, manufacturing, production, scientific studies, education, travel etc
6
model building and model deployment. The structured approach to data science helps in
maximizing the chances of success in a data science project at the lowest cost. The data
science process typically consists of the following steps:
ii. Creating Project Charter: A project charter has the information gathered
while setting the research goal. The project charter must contain a clear research goal, the
project mission and context, method for analysis, information about the resources
required for project completion, proof that the project is achievable, or proof of concepts,
deliverables and a measure of success and also a timeline for the project.
This information is useful to make an estimation of the project costs and the
data and people required for the project to become a success.
II. Data Acquisition or Data Collection or Data Retrieval: This phase involves
data gathering from various sources like webservers, logs, databases, API‘s and
online repositories. It involves acquiring data from all the identified internal &
external sources. One should start with data collection from internal sources. The
data may be available in many forms ranging from simple text to database
records. The data may be structured as well as unstructured. Data may be
acquired from sources outsise the organization also. Some sources may be
available free of cost or some may be paid. Data collection is a tiresome and a
time-consuming task.
III. Data preparation: Data collection is an error-prone process; in this phase, the
quality of the data is enhanced and it is prepared for use in subsequent steps.
This phase consists of three subphases: data cleansing, data integration data
transformation.
7
i) Data Cleansing involves cleansing of data collected in the previous phase. It
involves removing false values from a data source, removing inconsistencies
across data sources, checking misspelled attributes, missing and duplicate
values and typing errors, fixing capital letter mismatches as most programming
languages are case sensitive (e.g India and india), removing redundant white
spaces, sanity checks (check for impossible values like six-digit phone number
etc) and outliers (an outlier is an observation that seems to be distant from
other observations). It should be ensured that most errors are corrected in this
phase to make the data usable and get better results.
ii) Data Integration enriches data sources by combining information from
multiple data sources. Data comes from several sources and in this sub step the
focus is on integrating these different sources. The data can be combined in the
following ways:
• Joining Tables: To combine information about some data in one table with
the information available in another table. e.g a table may contain information about
purchase of a particular product and the other table may contain information about people
who have purchased that product. This can be done using join command in SQL.
• Appending or Stacking: To add observations from one table to another table
e.g. a table may contain the information about the purchase of a particular product in the
year 2000 and the other contains the similar data in the year 2001, then appending means
to add the records of 2001 to the table containing the records of 2000. This can be done
using the union function in SQL.
• View: View in SQL can be used to virtually combine two tables. This saves
on the space requirement
• Aggregate Functions: Aggregate functions may be used as per requirement
for combining data.
8
involves identifying the data model that best fits the business requirement. For this,
various machine learning techniques like KNN, Naïve‘s, decision tree etc may be
applied on data. The model is trained on the data set and the testing of the selected
model is done. Data modelling can be done using Python, R, SAS. Most models
consist of the following main steps:
VI. Presentation and automation:After the data is analysed and a data model built for
it, in this stage, the results are presented. These results can take many forms, ranging
from presentations to research reports.
VII. Deployment & Maintenance: Before the final deployment in the production
environment, testing is done in preproduction environment. After deployment, the
reports are used for real time analysis.
The Data Science life cycle is an Iterative process, there is often a need to step back and
rework certain findings. If the step 1(business understanding) is performed dedicatedly,
rework can be prevented. The figure 1 below describes the Data Science model.
9
Fig1source: https://livebook.manning.com/book/introducing-data-science/chapter-2/8
10
1.8.1 Types of Messy Data
i. Duplicate data : A record that shares data with another record in the marketing
database is called duplicate data. The database may contain a record that is a total
replica of some record or it may be a partially duplicate. Partial duplicate records are
the records that may contain the same name, phone number, email, or address as
another record, but also contain other non-matching data. Sometimes partial
duplicates are created by human error — when a customer or member of your team
enters information manually. When duplicate fields increase, cleansing grows
exponentially. Duplicate data is a big problem in a data-driven organization. It is very
important to clean and get rid of duplicates in any data.
iii. Incomplete Data: Incomplete dataanother kind of very common occurring dirty
data. A record that lacks key fields on master data records such as industry type, title
or last names, etc. which are useful for business is considered incomplete. For
example imagine trying to sell geolocation software to a prospect who is located at
―N/A‖.
11
campaign list. Inconsistent data hinders analytics and makes segmentation difficult
when you have to consider all variables of the same title, industry, etc.
1.9.1 Anomaly is a deviation in data from the expected value for a metric at a given point in
time. An unexpected change within these data patterns, or an event that does not conform to
the expected data pattern, is considered an anomaly. Anomaly detection is any process that
finds the outliers (items that do not belong to the dataset) of a dataset. The term anomaly is
also referred to as outlier. Outliers are the data objects that stand out among other objects in
the data set and do not conform to the normal behaviour in a data set. Anomaly detection is a
technique for finding an unusual point or pattern in a given set. Anomaly detection is a data
science application that combines multiple data science tasks like classification, regression,
and clustering.There are three kinds of anomalies namely: point anomaly, contextual
anomaly, and collective anomalies.
• Point Anomaly: If a single instance in a given dataset is different from others with
respect to its attributes, it is called a point anomaly i.e. when a single instance of data
is anomalous, it deviates largely from the rest of the set e.g. detecting credit card
fraud based on ―amount spent.‖
• Contextual anomaly: If the data is anomalous in some context, it is called contextual
anomaly. This type of anomaly is common in time-series data. In the absence of a
context, all the data points look normal. E.g. if the context of the temperature is
recorder in December and a high temperature reading is seen in December month,
which is an abnormal phenomenon.
• Collective anomalies can be formed due to a combination of many instances i.e. a set
of data instances collectively helps in detecting anomalies. For example, sequence
data in network log or an attempt to copy data form a remote machine to a local host
unexpectedly, an anomaly that would be flagged as a potential cyberattack.
Anomaly detection refers to the problem of finding patterns in data that do not conform to
expected behaviour. It is a technique for finding an unusual point or pattern in a given set.
These nonconforming patterns are often referred to as anomalies, outliers, discordant
observations, exceptions, aberrations, surprises, peculiarities, or contaminants in different
application domains. Anomaly detection is commonly used for:
• Data cleaning
• Intrusion detection
• Fraud detection
• Systems health monitoring
• Event detection in sensor networks
• Ecosystem disturbances
12
1.9.2 Artifact It is a data flaw caused by equipment, techniques or conditions. Common
sources of data flaws include hardware or software errors, conditions such as electromagnetic
interference and flawed designs such as an algorithm prone to miscalculations. Some
common data artifacts are:
• Digital Artifacts: Flaws or errors in digital media, documents and data records
caused by data processing e.g a distorted camera recording.
• Visual Artifacts: Flaws in visualizations such as user interfaces.
• Compression Artifacts: Flaws in data due to lossy compression.
• Statistical Artifacts: Flaw such as a bias in statistical data.
• Sonic Artifacts: Unwanted sound in a recording.
• missing data
• unstructured data
• multiple variables in one column
• variables stored in the wrong places
• observations split incorrectly or left together against normalization rules
• switched columns and rows extra spaces
13
only to have them spend the majority of their time cleaning data. Programming
languages are effective in wrangling big, complex data, but they limit data
preparation to a small group of people, which creates a big bottleneck.
• It’s error-prone: Datasets need to be prepared effectively as well as efficiently.
There is a possibility of errors to remain until reviewed by another person it‘s often
difficult (or near impossible) for others to revise the data cleaning work.
Avoid Typos
Typos are a result of human error. They are essential to fix because models treat different
values differently. Strings rely a lot on their spellings and cases. Another error similar to
typos is of strings‘ size. They might need to be padded to maintain the format. For example, a
dataset might require to have 5-digit numbers only. So if you have any value which only has
four digits such as ‗1234‘, zero must be added in the beginning to increase its number of
digits as ‗01234‘. An additional error with strings is of white spaces. Extra white spaces
should be removed.
14
If you can‘t convert a specific data value, you should enter ‗NA value‘ or
something of this sort.
Imputing Missing Values: means assuming the approximate value. Methods like linear
regression or median can be used to calculate the missing value. Another method to impute
missing values is to copy the data from a similar dataset. This method is called ‗Hot-deck
imputation‘.
Highlighting Missing Values: Imputation isn‘t always the best measure to take care of
missing values. Many experts argue that it only leads to more mixed results as they are not
‗real‘. So, you can take another approach and inform the model that the data is missing. The
missing valurs can be highlighted or flagged. For example, your records may not have many
answers to a specific question of your survey because your customer didn‘t want to answer it
in the first place. If the missing value is numeric, you can use 0. Just make sure that you
ignore these values during statistical analysis. On the other hand, if the missing value is a
categorical value, you can fill ‗missing‘.
1.11 SUMMARY
This chapter introduces the concept of data science as a discipline. The various issues faced
due to the growth in data are discussed. The importance of cleaning data and the anomalies
found in data are also discussed.
15
B.Sc.(DATA SCIENCE)
SEMESTER-I
STRUCTURE
2.0 Objectives
2.1 Introduction to Data Acquisition
2.2 Data Preprocessing and Techniques
2.2.1 Data Cleaning
2.2.2 Data Integration
2.2.3 Data Transformation
2.2.4 Data Reduction
2.2.4.1 Dimensionality Reduction
2.2.4.2 Numerosity Reduction
2.2.4.3 Data Compression
2.3 Data Mining
2.3.1 Data Mining Applications
2.4 Data Interpretation
2.5 Summary
2.6 Practice Question
16
2.0 OBJECTIVES
1. Introduction to Data Acquisition
3. To provide the concept of data pre-processing and familiarize with the various
reprocessing techniques
The actual data is then divided mainly into two types known as:
i. Primary data
ii. Secondary data
i. Primary Data:
The data which is raw, original, and is extracted directly from the official sources is known as
primary data. This type of data is collected directly by performing techniques such as
questionnaires, interviews, and surveys. The data collected must be according to the demand
and requirements of the problem on which analysis is performed so as to get the accurate
analysis of the problem. Some methods of collecting primary data:
a) Interview method
17
The data collected during this process is through interviewing the target audience by a person
called interviewer and the person who answers the interview is known as the
18
interviewee. The answers to the questions asked are noted down in the form of notes, audio,
or video and this data is stored for processing. This data can be both structured and
unstructured like personal interviews or formal interviews through telephone, face to face,
email, etc.
b) Survey method
The survey method is the process of research where a list of relevant questions are asked and
answers are noted down in the form of text, audio, or video. The survey method can be
obtained in both online and offline mode like through website forms and email. Then that
survey answers are stored for analyzing data. Examples are online surveys or surveys through
social media polls.
c) Observation method
In this method, the researcher keenly observes the behavior and practices of the target
audience using some data collecting tool and stores the observed data in the form of text,
audio, video, or any raw formats. In this method, the data is collected directly by posting a
few questions on the participants. For example, observing a group of customers and their
behavior towards the products. The data obtained will be sent for processing.
4. Experimental method:
The experimental method is the process of collecting data through performing experiments,
research, and investigation. The most frequently used experiment methods are CRD, RBD,
LSD, FD.
• CRD- Completely Randomized design is a simple experimental design used in
data analytics which is based on randomization and replication. It is mostly used
for comparing the experiments.
• RBD- Randomized Block Design is an experimental design in which the
experiment is divided into small units called blocks. Random experiments are
performed on each of the blocks and results are drawn using a technique known as
analysis of variance (ANOVA). RBD was originated from the agriculture sector.
• LSD – Latin Square Design is an experimental design that is similar to CRD and
RBD blocks but contains rows and columns. It is an arrangement of NxN squares
with an equal amount of rows and columns which contain letters that occurs only
once in a row. Hence the differences can be easily found with fewer errors in the
experiment. Sudoku puzzle is an example of a Latin square design.
• FD- Factorial design is an experimental design where each experiment has two
factors each with possible values and on performing trail other combinational
factors are derived.
Secondary data is the data which has already been collected and reused again for some
valid purpose. This type of data is previously recorded from primary data and it has two types
of sources named internal source and external source.
19
Internal source
These types of data can easily be found within the organization such as market record, a sales
record, transactions, customer data, accounting resources, etc. The cost and time consumption
is less in obtaining internal sources.
External source
The data which can‘t be found at internal organizations and can be gained through external
third party resources is external source data. The cost and time consumption is more because
this contains a huge amount of data. Examples of external sources are Government
publications, news publications, Registrar General of India, planning commission,
international labor bureau, syndicate services, and other non-governmental publications.
Other sources:
• Sensors data: With the advancement of IoT devices, the sensors of these devices
collect data which can be used for sensor data analytics to track the performance and
usage of products.
• Satellites data: Satellites collect a lot of images and data in terabytes on daily basis
through surveillance cameras which can be used to collect useful information.
The characteristics of big data namely volume, velocity, variety, and value and very
important for the acquisition of data. In data science, there is a variety of data that we need to
deal with. The data is majorly characterized as:
• Structured Data: The data which is available in some stand format is called
structured data. Structured data is organized data. It is stored in the row and column
structure like in a relational database and excel sheets. Some examples of structured
data are names, numbers, geolocations, addresses etc.
• Unstructured Data: This data is unorganized data. There is no particular format for
unstructured data. It is available in a variety of formats like texts, pictures, videos
etc. Unstructured data is more difficult to search and requires processing to become
understandable.
The data acquired needs to be preprocessed to make it usable. The raw data may be
inconsistent, erroneous, incomplete and not in the required format. These issues need to be
resolved to make the data usable. Data Preprocessing is a collaborative term used for the
activities involved in transforming the real-world data or raw data into a usable form to make
it more valuable and to get it in the required format. The preprocessed data is cleaner and
more valuable, and hence used as final training set. Data preprocessing is a very essential
20
step, as more clean and inconsistent data we get, better shall be the final output. In other
words, data processing improves the quality of data. The huge size of data collected from
heterogenous sources leads to anomalous data. Data preprocessing has become a vital and
most fundamental step considering the fact that high quality data leads to better models and
predictions. Data preprocessing techniques are majorly categorized in the following methods:
2.2.1 Data Cleaning involves removing duplicate data, filling missing values, identifying
outliers, smoothening noise and remove data inconsistencies. The following steps
must be followed for cleaning data:
• Fix structural errors: Structural errors are when you measure or transfer data and
notice strange naming conventions, typos, or incorrect capitalization. These
inconsistencies can cause mislabelled categories or classes. For example, you may
find ―N/A‖ and ―Not Applicable‖ both appear, but they should be analysed as the
same category. Set the single date format in case there are multiple date formats in a
single column.
• Filter unwanted outliers is essential to improve the performance of the data. Outliers
are the data objects that stand out among other objects in the data set and do not
conform to the normal behaviour in a data set. If an outlier is found to be irrelevant it
must be handled. Some techniques to remove outliers from huge datasets are:
boxplots,Z-score and interquantile range. The outliers need to be handled as per the
problem. They can be either removed or the outlier is capped or else it can be
imputed.
• Handle missing data: During analysis of data, it is important to understand why the
missing values exist. Whether the missing value exist because they were not recorded
or they imply that data does not exist for the missing values. Missing data may be
handled by using the following ways:
21
i) In case, the values were not recorded, it becomes essential to fill up the values
by guessing based on the other values in that column and row. This is called imputation.
E.g. if a value is missing for gender, it is understood that the value was not recorded, so we
need to analyse data and give it a value, we cannot leave the value blank in this case.
Therefore, you can input missing values based on other observations; but there is a chance
of losing integrity of the data because these values are being filled based on assumptions
and not actual observations.
ii) If a value is missing because it doesn't exist e.g. the height of the oldest child of
someone who doesn't have any children. In this case, it would make more sense to either
leave it empty or to add a third value like NA or NaN. The NA or NaN values should be
replaced with 0. Panda's fillna() function to fill in missing values in a data frame can be
used.
iii)In case, it is not possible to figure out the reason for the missing, then
that particular value can be dropped. Pandas function, dropna() can be used to do
this, but doing this can drop or lose information, so be mindful of what is being
removed or dropped this before removing it.
• Noisy Data: Random variance in the data is called noise. It is generated due to faulty
data collection or errors in data collection. It adversely affects the analysis of the
data.The following methods called smoothening techniques are used to handle noisy
data:
Algorithm:
22
3. Calculate the arithmetic mean or media or replace by boundary (min and max value)
4. Replace each data element in each bin by the calculated mean/ median/ boundaries.
Step 2: Partition the data into equal frequency bins of size of bin depth(n/d) where n=
no. of elements and d= bin depth
N/D=12/3=4 bins
Bin 1: 6, 6 , 9
Bin 2: 12,14, 16
Bin 3: 18,18,18
Bin 4: 20, 21, 22
Binning using Median: In this method, Step 1 and step 2 are same.
Step 4: Replace each data element in each bin by the calculated mean
Bin 1: 6, 6, 6
Bin 2: 14,14,14
Bin 3: 18,18,18
Bin 4: 21, 21, 21
23
Binning using Boundary Values: In this, we keep the minimum as well as maximum
values.
Bin 1: 6, 6, 9
Bin 2: 12,12,16
Bin 3: 18,18,18
Bin 4: 20, 20, 22
Data Integration is a vital step in which the data acquired from multiple sources is integrated
into a single centralized location. Data Integration makes data comprehensive
and more usable.
Data Store3
Data Store1
Integrated Data
Data Store4
Data Store 2
a) Data Consolidation
b) Data Propagation
24
c) Data Virtualization
d) Data Warehousing
a) Data consolidation means to consolidate data from several separate sources into one
data store, so that it is available to all the stake holders to enable better decision making. It
involves eliminating redundancies, removing inaccuracies before consolidating to a single
store. The most common data consolidation techniques are ETL (Extract, Transform, Load),
Data virtualization and Data warehousing.
• ETL is the most widely used data consolidation technique. The data is first extracted
from multiple sources, then it is transformed into an understandable format by using
various functions like sorting, aggregating, cleaning etc and then
25
transfer it to a centralized store like another database or data ware house. The ETL
process cleans, filters, and transforms data, and then applies business rules before
data populates the new source. ETL is further of two types namely Real time ETL
used in real time systems and Batch processing ETL used for high volume
databases.
• Data Virtualization: In this method, the data stays in the original location, but
changes are made in a virtual manner and can be seen in a consolidated manner by
the users. It is a logical layer that amalgamates data from various sources without
performing actual ETL process. It is an abstraction such that only the required data is
visible to the users without requiring technical details about the location or structure
of the data source. It provides enhanced data security.
• Data Warehousing is the integration of data from multiple sources to a centralized
source to facilitate decision making, reporting and query handling. A centralized
source of data enables better decision making. Data warehousing is the secure
electronic storage of information by a business or other organization. The goal of
data warehousing is to create a trove of historical data that can be retrieved and
analyzed to provide useful insight into the organization's operations.
b) Data Propagation involves copying data from one location i.e., source to another
location i.e., target location. It is event driven. These applications usually operate online
and push data to the target location. They are majorly useful for real time data
movement such as workload balancing, backup and recovery. Data propagation can be
done asynchronously or synchronously.
The two methods for data propagation are: Enterprise Application Integration (EAI) and
Enterprise Data Replication (EDR). The key advantage of data propagation is that it can
be used for real-time / near-realtime data movement and can also be used for workload
balancing, backup and recovery. EAI is used majorly for the exchange of messages and
transactions in real-time business transaction processing; whereas for transfer of
voluminous data between databases, is used.
c) Data Virtualization: In this, data is not stored in a single location, but is abstracted and
can be viewed as unified view of data from multiple sources. Data Federation is a form
of data virtualization, supported by Enterprise information technology (EII). EII
products have evolved from two different technological backgrounds – relational DBMS
and XML, but the current trend of the industry is to support both approaches, via SQL
(ODBC and JDBC) and XML (XML Query Language - XQuery - and XML Path
Language - XPath) data interfaces.
26
2.2.3 Data Transformation
After the data has been acquired, it is cleaned as discussed above. The clean data may not be
in a standard format. The process of changing the structure and format of data to make it
more usable is called data transformation. Data transformation may be constructive,
destructive, aesthetic or structural.
Scripting languages like Python or domain-specific languages like SQL are usually used for
data transformation.
• Principal Component Analysis (PCA) is applied to skewed and sparse data. In this
method, the entire data set is represented by few independent tuples with ‗n‘ attributes.
27
This method was introduced by Karl Pearson. It works on a condition that while the
data in a higher dimensional space is mapped to data in a lower dimension space, the
variance of the data in the lower dimensional space should be maximum. It involves the
following steps:
i. Construct the covariance matrix of the data. ii. Compute the eigenvectors of this
matrix. iii. Eigenvectors corresponding to the largest eigenvalues are used to
reconstruct a large fraction of variance of the original data.
2.2.4.2 Numerosity Reduction: is the reduction of original data and its representation in a
smaller form. It can be done in two ways: parametric and non-parametric numerosity
reduction.
i) Parametric Numerosity Reduction: In this method, only the data parameters are
stored, instead of the entire original data. The data is represented using some model
to estimate the data, so that only parameters of data are required to be stored,
instead of actual data. Regression and Log-Linear methods are used for creating
such models.
ii) Non- Parametric Numerosity Reduction methods are used for storing reduced
representations of the data include histograms, clustering, sampling and data cube
aggregation.
• Histogram is the data representation in terms of frequency. It uses binning to
approximate data distribution and is a popular form of data reduction.
• Clustering divides the data into groups/clusters. This technique partitions the
whole data into different clusters. In data reduction, the cluster representation of
the data are used to replace the actual data. It also helps to detect outliers in data.
• Sampling can be used for data reduction because it allows a large data set to be
represented by a much smaller random data sample (or subset).
28
• Data Cube Aggregation involves moving the data from detailed level to a fewer
number of dimensions. The resulting data set is smaller in volume, without loss of
information necessary for the analysis task.
2.2.4.3 Data Compression is a technique in which the original data is compressed for
reduction. This compressed data can again be reconstructed to form the original data. If the
data is reconstructed without losing any information, then it is a ‗lossless‘ data reduction.
In recent data mining projects, various major data mining techniques have been developed
and used, including association, classification, clustering, prediction, sequential patterns, and
regression.
• Association is a data mining technique useful to discover a link between two or more
items. It uses if-then statements to show the probability of interactions between data
items within large data sets. It finds a hidden pattern in the data set. It is commonly
used to help sales correlations in data or medical data sets.
• Clustering is a data mining technique in which clusters are created of objects that
share the same characteristics. Clusters relate to hidden patterns. It is based on
unsupervised learning.
• Classification classifies items or variables in a data set into predefined groups or
classes. It is based on unsupervised learning.
• Prediction is used in predicting the relationship that exists between independent and
dependent variables as well as independent variables alone. It can be used to predict
future trends. It analyses past events or instances in the right sequence to predict a
future event.
• Sequential patterns is a data mining technique specialized for evaluating sequential
data to discover sequential patterns. It comprises of finding interesting subsequence in
29
a set of sequences, where the stake of a sequence can be measured in terms of
different criteria like length, occurrence frequency, etc.
• Regression: Regression analysis is the data mining process is used to identify and
analyze the relationship between variables because of the presence of the other factor.
It is used to define the probability of the specific variable. Regression, primarily a
form of planning and modeling. For example, we might use it to project certain costs,
depending on other factors such as availability, consumer demand, and competition.
Primarily it gives the exact relationship between two or more variables in the given
data set.
2.5 SUMMARY
This chapter introduces the various facets of data and the varius data preprocessing
techniques. A brief introduction to dat mining and data interpretation is provided in this
chapter.
30
2.6 PRACTICE QUESTIONS
31
B.Sc.(DATA SCIENCE)
SEMESTER-I
STRUCTURE
3.0 Objectives
3.1 Big Data Analytics
3.2 Data Science – working
3.3 Facets of Data
3.3.1 Natural Language Data
3.3.2 Machine Generated Data
3.3.3 Streaming Data
3.3.4 Acoustic, Video and Image Data
3.3.5 Sensor Data
3.3.6 Graph/ Network Data
3.4 Multiple Problems in Handling Large Datasets
3.5 General Techniques for Handling Large Data
3.5.1 Choosing the Right Algorithm
3.5.2 Right Data Structure
3.5.3 Choosing the Right Tools
3.6 Distributing Data Storage and Processing with Frameworks
3.6.1 Hadoop
3.6.2 Apache Spark
3.6.3 Apache Storm
3.6.4 Samza
3.6.5 Flink
3.7 Summary
3.8 Practice Questions
32
3.0 OBJECTIVES
1. To familiarize with the data representation and types of data
2. To familiarize with the general techniques of handling data
What do we do with all of this data? How do we make it useful to us? What are its real-world
applications? These questions are the domain of data science.
• Structured
• Unstructured
33
• Natural language
• Machine-generated
• Streaming
• Acoustic data, video, and images
• Sensor Data
• Graph-based/ Network data
As in the previous chapter we have explored the data categories structured versus
unstructured data. Let‘s talk about the rest of the representation of data and the explore the
characteristics. The structured data is the main type of data explored in a model and can be
expressed as a record with a field of fixed length and dimension. The RDBMS and the Excel
data is usually the structured data with known or used defined datatypes with structural
information. Bu the unstructured data is data that may not fit any kind of mathematical or
simulated model for the study of data due to the context-specific or varying nature of the
stored raw data. One example of unstructured data is the email.
34
decipher the meaning of every piece of text. The meaning of the same words can vary when
coming from someone upset or joyous. This is illustrated in figure 3.1.
Examples are the ―What‘s trending‖ on Twitter, live sporting or music events, and the stock
market.
35
3.3.4 Acoustic, video and image data
The world of animation and digitization has developed multimedia datasets with audio,
video and images. These types of datsets can be stored, handled and interpreted with the
Object-Oriented Databases. The databases include the class inheritance and interface in a
prespecified format for handling the complexity of the data and finding its application in
Digital libraries, video-on demand, news-on demand, musical database, etc. The other type of
information in the multimedia datasets can be represented and reproduced as sensory
experiences - touch, sense and hear which are typically leading to storing a huge amount of
data handled by digitizing. Furthermore, data compression for images and sounds can exploit
limits on human senses performed by throwing away information not needed for good-quality
experience which is performed by compression. There are certain limitations when you deal
with such a large amount of data are described below:
Vector graphics encode the image sequences as a series of lines or curves. The process is
expensive in term of image computation but smoothly rescales.
Bit map encode the image as an array of pixels. The encoding process is cost effective in
terms of computation but scales inefficiently leading to loss of image data.
The Basic idea of image data is the array of receptors where each receptor records a pixel by
―counting‖ the number of photons that strike it during exposure. The Red, green, blue
recorded separately at each point on image produced by group of three receptors where each
receptor is behind a color filter.
36
which are either lossy or lossless. The lossy compression of the acoustic data may lead to loss
of certain information as it is non-repetitive: MPEG (like JPEG) a family of
perceptuallybased techniques are all lossy techniques. While the other type of techniques is
where the information of the acoustic data is preserved in which WMA Lossless, ALAC,
MPEG-4 ALS methods are applied.
The encoding of data in form of the sounds heard or the visuals captured are highly
challenging for the data scientists to process the information. Some tasks performed by the
human brain have been of great difficulty to the computers for the recognition of the object in
the images. MLBAM (Major League Baseball Advanced Media) announced in 2014 that
they‘ll increase video capture to approximately 7 TB per game for the purpose of live,
ingame analytics. The higher resolution cameras and acquiring sensors have the capability to
capture the motion of ball and the player in a real time scenario e.g., the path taken by a
defender relative to two baselines.
The exponential growth of the IoT based systems leads to ever demanding rise in the
input sensor devices which are responsible for the collection, storage and interpretation of the
captured data. In addition, consumers, organizations, governments and companies themselves
produce more and more data, for example on social media. The amount of data is growing
exponentially. People speak of Big Data when they work with one or more datasets that are
too large to be maintained with regular database management systems.
The pros of applying sensor data to the input devices is that the decisions of the IoT
devices are subjected to information accessed from the evidences and not from any kind of
irrelevant details and subjective experiences. This type of knowledge base makes the system
cost effective with the enhanced streamlined processes, boosted product quality and better
services. By combining data intelligently and by interpreting / translating, new insights are
37
created that can be used for new services, applications and markets. This information can also
be combined with data from various external sources, such as weather data or demographics.
3.3.6 Graph/ Network Data
―Graph data‖ is the type of data which can be represented as graph with a special
characteristic of comprising the mathematical graph theory into the mined information.
―Graph‖ in the network data generally represents a model in the statistical and mathematical
domain with pair wise relationship in the constituent objects. Graph or network data is, in
short, data that focuses on the relationship or adjacency of objects. The representation of the
graph models includes the nodes, edges, and relationships between the stored data in the
nodes. Graph-based data is a natural way to represent social networks, and its structure
allows you to calculate specific metrics such as the influence of a person and the shortest path
between two people.
Examples of graph-based data can be found on many social media websites (figure 3.3). For
instance, on LinkedIn you can see who you know at which company. Your follower list on
Twitter is another example of graph-based data. Graph databases are used to store graphbased
data and are queried with specialized query languages such as SPARQ.
Some common big data problems and the respective solutions have been discussed below.
38
1. Lack of Understanding
The lack of understanding of the mined data in the data science-based companies might
lead to knocked down the performance in many areas. Many of the major areas for the
databased companies were: depreciate the expenses of mining information, innovate new
ideas for interpretation, new product launching, enhance performance and so on. Despite the
benefits, companies have been slow to adopt data for a data centric approach.
Solution: Follow a top-down approach for the introduction and manipulation of the data
science based on the procedures followed up. In case of lack of a data science professional,
the consultancy services or an IT proficient with data science knowledge should be hired to
get a better understanding.
The companies have understood that buying and maintaining of necessary components make
the system less cost effective. In addition to cost of the servers and the software-based
storage, the high-end cost of the data science experts makes the system time consuming.
Solution: The solution is to understand the need and use of the data with a collaborative
method to find a goal, conduct a research or solution and implement the execution with a
plan.
Coined as the ―paradox of choice,‖ Schwartz explains how option overload can cause
inaction on behalf of a buyer. In the world of data and data tools, the options are almost as
widespread as the data itself, so it is understandably overwhelming when deciding the
solution that‘s right for the business, especially when it will likely affect all departments and
hopefully be a long-term strategy.
Solution: Like understanding data, a good solution is to leverage the experience of your
inhouse expert, perhaps a CTO. If that‘s not an option, hire a consultancy firm to assist in the
decision-making process. Use the internet and forums to source valuable information and ask
questions.
The systems to understand the data management and finding a relevant solution for the
manipulation of data is in itself a problem Due to the vast expanse of the different types of
data with the IT teams creating their own data during the process of data handling results in
increased complexity.
Solution: Find a solution with a single command center, implement automation whenever
possible, and ensure that it can be remotely accessed 24/7.
39
5. Security Gaps
Another important aspect of the data science is the security of the data and the biasing of
the large amount of data is always possible. In order to handle it the encryption and
decryption must be performed with the data store with proper storage.
Solution: The data need to be handled with automated security updates of the data warehouse
and automated backups.
6. Low Quality and Inaccurate Data
Having data is only useful when it‘s accurate. Low quality data not only serves no
purpose, but it also uses unnecessary storage and can harm the ability to gather insights from
clean data.
• Inconsistent formatting (which will take time to correct and can happen when the
same elements are spelled differently like ―US‖ versus ―U.S.‖),
• Missing data (i.e. a first name or email address is missing from a database of
contacts), • Inaccurate data (i.e. it‘s just not the right information or the data has
not be updated).
• Duplicate data (i.e. the data is being double counted)
• If data is not maintained or recorded properly, it‘s just like not having the data in
the first place.
Solution: Begin by defining the necessary data you want to collect (again, align the
information needed to the business goal). Cleanse data regularly and when it is collected from
different sources, organize and normalize it before uploading it into any tool for analysis.
Once you have your data uniform and cleansed, you can segment it for better analysis.
7. Compliance Hurdles
When collecting information, security and government regulations come into play.
With the somewhat recent introduction of the General Data Protection Regulation (GDPR),
it‘s even more important to understand the necessary requirements for data collection and
protection, as well as the implications of failing to adhere. Companies have to be compliant
and careful in how they use data to segment customers for example deciding which customer
to prioritize or focus on. This means that the data must: be a representative sample of
consumers, algorithms must prioritize fairness, there is an understanding of inherent bias in
data, and Big Data outcomes have to be checked against traditionally applied statistical
practices.
Solution: The only solution to adhere to compliance and regulation is to be informed and
well-educated on the topic. There‘s no way around it other than learning because in this case,
ignorance is most certainly not bliss as it carries both financial and reputational risk to your
40
business. If you are unsure of any regulations or compliance you should consult expert legal
and accounting firms specializing in those rules.
There is no such relationship between the problems discussed in the section above and
41
Fig. 3.1.5 The algorithms for handling the Big Data[12]
The selection of an effective algorithm for the processing of the data can provide
better results as compared to the enhanced hardware. In order to perform the predictive or
selective analysis the best-chosen algorithm need not necessarily load the complete data into
the memory or RAM, it rather supports the parallel computations with parallel or distributed
databases. In this section three types of algorithms have been discussed that can perform the
computations parallelly reducing the computation or memory overhead: online algorithms,
block algorithms, and MapReduce algorithms, as shown in figure 3.5.
Several, but not all, machine learning algorithms can be trained using one observation
at a time instead of taking all the data into memory. The model can be trained based on the
current parameters and the previous parameter values can be made to forget by the algorithm.
This technique of ―use and forget ‖ helps to attain high memory effectiveness in the system.
This way as the new data values is acquired by the algorithm, the previous values are
forgotten or overwritten but the effect can be witnessed or observed in the performance
metrics of the proposed model. Most online algorithms can also handle mini-batches; this
way, the data science exert or the developer can feed the batches of 10 to 1,000 observations
at one single instance and then applying the sliding window protocol to access the data. This
learning can be handled by multiple means discussed as follows:
• Full batch learning (also called statistical learning) —Feed the algorithm all
the data at once.
• Mini-batch learning —Feed the algorithm a spoonful (100, 1000, ...,
depending on what your hardware can handle) of observations at a time.
• Online learning —Feed the algorithm one observation at a time.
The algorithms as discussed can enhance the performance and execution for the
manipulation of the data in the warehouse. This process in the data science field actually
leads to fragmentation in the raw data so that is the reason the structures for the storage of
the data is equally important for the data scientist or a data science researcher. Data
42
structures have different storage requirements, but also influence the performance of
CRUD (create, read, update, and delete) and other operations on the data set.
Sparse Data
Most of the IT professional and the data scientist understand the representation
through a sparse matrix. Similarly, the sparse data illustration can be done by giving a
minimal detail about the data as compared to the total number of entries. As per figure
3.7, the conversion of the data from the text to binary or an image to binary can be
expressed as the sparse data. Imagine a set of 100,000 completely unrelated Twitter
tweets. Most of them probably have fewer than 30 words, but together they might have
hundreds or thousands of distinct words. For this reason, the text documents are processed
for the stop words, cut into fragments and stored as vectors instead of the binary
information. The basic idea is that any word present in the tweet is expressed as 1 and not
in tweet is expressed as 0 resulting in sparse data indeed. But the matrix generated would
require equal memory as compared to any other matrix even though it has a little
information.
Tree Structure
Trees is a special type of data structure that has faster retrieval of information in
comparison to the table or sparse data. In these data structures the root node is the first
directory for accessing the information and the child nodes are the sub directories of the
root node. The information can be accessed form the child or leaf nodes by either using
pointers or indexing of the tree structure. The figure 3.8 helps the researcher to
understand the process of information retrieval in the tree structure.
43
Hash Tables
The process of the calculation of the key for each data entry and allotting the key to
the relevant bucket is the important process of the hash table structure for th storage of
data. The process of storage and handling in the hash table makes it more reliable source
for the retrieval of information based on the key value. This way you can quickly retrieve
the information by looking in the right bucket when you encounter the data. Dictionaries
in Python are a hash table implementation, and they‘re a close relative of key-value
stores. You‘ll encounter them in the last example of this chapter when you build a
recommender system within a database. Hash tables are used extensively in databases as
indices for fast information retrieval.
As discussed earlier in section 3.4.1 and 3.4.2 with the need to have a right algorithm
44
Python has a number of libraries that can help you deal with large data. They range
from smarter data structures over code optimizers to just-in-time compilers. The
following is a list of libraries we like to use when confronted with large data:
Cython —The closer to the actual hardware of a computer, the more vital it is for the
computer to know what types of data it has to process. For a computer, adding 1 + 1 is
different from adding 1.00 + 1.00. The first example consists of integers and the second
consists of floats, and these calculations are performed by different parts of the CPU.
Python needs no specifications of the types of the data but the compiler can itself interpret
the values based on the integer or float. This is although a slow process and so a superset
of python can be used for the solution which forces the user to define the data type before
the execution and hence can be implemented much faster. See http://cython.org/ for more
information on Cython.
Numexpr —Numexpr is at the core of many of the big data packages, as is NumPy for
in-memory packages. Numexpr is a numerical expression evaluator for NumPy but can be
many times faster than the original NumPy. https://github.com/pydata/numexpr.
Numba —Numba is the package which compiles (just-in-time compiling) the code for
the data manipulation and handling before actually executing it which makes it faster and
less prone to errors. With the user gets a platform to develop a high level code wit the
speed of the compilation as in C. http://numba.pydata.org/.
Bcolz —Bcolz helps you overcome the out-of-memory problem that can occur when
using NumPy. It can store and work with arrays in an optimal compressed form. It not
only slims down your data need but also uses Numexpr in the background to reduce the
calculations needed when performing calculations with bcolz arrays.
http://bcolz.blosc.org/.
Blaze — Blaze is ideal in case the data scientist use the power of a database backend
but like the ―Pythonic way‖ of working with data. Blaze will translate your Python code
into SQL but can handle many more data stores than relational databases such as CSV,
Spark, and others. Blaze delivers a unified way of working with many databases and data
libraries. http://blaze.readthedocs.org/en/latest/index.html.
Theano —Theano enables you to work directly with the graphical processing unit
(GPU) and do symbolical simplifications whenever possible, and it comes with an
excellent just-in-time compiler. On top of that it‘s a great library for dealing with an
advanced but useful mathematical concept: tensors.
http://deeplearning.net/software/theano.
45
amounts of data require more robust computer software for processing, best handled by
data processing frameworks. These are the top preferred data processing frameworks,
suitable for meeting a variety of different needs of businesses.
3.6.1 Hadoop
This is an open-source batch processing framework that can be used for the
distributed storage and processing of big data sets. Hadoop relies on computer clusters
and modules that have been designed with the assumption that hardware will inevitably
fail, and those failures should be automatically handled by the framework.
There are four main modules within Hadoop. Hadoop Common is where the libraries
and utilities needed by other Hadoop modules reside. The Hadoop Distributed File
System (HDFS) is the distributed file system that stores the data. Hadoop YARN (Yet
Another Resource Negotiator) is the resource management platform that manages the
computing resources in clusters, and handles the scheduling of users‘ applications. The
Hadoop MapReduce involves the implementation of the MapReduce programming model
for large-scale data processing.
Hadoop operates by splitting files into large blocks of data and then distributing those
datasets across the nodes in a cluster. It then transfers code into the nodes, for processing
data in parallel. The idea of data locality, meaning that tasks are performed on the node
that stores the data, allows the datasets to be processed more efficiently and more quickly.
Hadoop can be used within a traditional onsite datacenter, as well as through the cloud.
46
Spark SQL, which provides domain-specific language used to manipulate Data
Frames.
Spark Streaming, which uses data in mini-batches for RDD transformations, allowing
the same set of application code that is created for batch analytics to also be used for
streaming analytics. Spark MLlib, a machine-learning library that makes the large-scale
machine learning pipelines simpler. GraphX, which is the distributed graph processing
framework at the top of Apache Spark.
3.6.4 Samza
Samza is another open-source framework that offers near a real-time, asynchronous
framework for distributed stream processing. More specifically, Samza handles
immutable streams, meaning transformations create new streams that will be consumed
by other components without any effect on the initial stream. This framework works in
conjunction with other frameworks, using Apache Kafka for messaging and Hadoop
YARN for fault tolerance, security, and management of resources.
3.6.5 Flink
Flink is a hybrid framework, open-source, and stream processes, but can also manage
batch tasks. It uses a high-throughput, low-latency streaming engine that is written in Java
and Scala, and the runtime system that is pipelined allows for the execution of both batch
and stream processing programs. The runtime also supports the execution of iterative
algorithms natively. Flink‘s applications are all fault-tolerant and can support exactlyonce
semantics. Programs can be written in Java, Scala, Python, and SQL, and Flink offers
support for event-time processing and state management.
3.7 SUMMARY
Data science involves using methods to analyze massive amounts of data and extract
the knowledge it contains. You can think of the relationship between big data and data
science as being like the relationship between crude oil and an oil refinery. Data science
and big data evolved from statistics and traditional data management but are now
considered to be distinct disciplines.
The foremost aspect for the data scientist to conduct the refining of the raw data is the
representation of data dealing with the problems of big data. Another aspect of the dealing
47
with the problems of big data is to perform the processing of the big data with the
frameworks. For this multiple special type of data have been explored.
There are multiple problems being faced by the data scientists for the processing of
the raw data and mining it into useful and relevant information. Luckily, there are
pragmatic solutions that companies can take to overcome their data problems and thrive
in the data-driven economy. The problems and the relevant solutions have been discussed
in the chapter. Data processing frameworks are not intended to be one-size-fits-all
solutions for businesses. Hadoop was originally designed for massive scalability, while
Spark is better with machine learning and stream processing. A good IT services
consultant can evaluate your needs and offer advice. What works for one business may
not work for another, and to get the best possible results, you may find that it‘s a good
idea to use different frameworks for different parts of your data processing.
48
B.Sc.(DATA SCIENCE)
SEMESTER-I
STRUCTURE
4.0 Objectives
4.1 Data Science Ethics
4.2 Data Science – Good Aspects for Technology
4.3 Owners of Data
4.3.1 Responsibilities of the Data Owner
4.3.2 The Importance of Assigning Data Owners
4.3.3 Identification of Data Owners: Three Questions to Ask
4.4 Different Aspects of Privacy
4.5 Five C’ s of Data Science
4.5.1 Consent
4.5.2 Clarity
4.5.3 Consistency and trust
4.5.4 Control and transparency
4.5.5 Consequences
4.6 Diversity – Inclusion
4.7 Future Trends
4.8 Summary
4.9 Practice Questions
4.1 OBJECTIVE
1. To familiarize with the Data Science Ethics
2. To familiarize with different aspects of privacy
3. To familiarize with the future trends of data Science
49
4.2 DATA SCIENCE ETHICS
The skill to extract or mine the relevant patterns and the transforming capability to
revolutionize the products in data science helps to bring a positive change in the social and
technological sphere making it ethically neutral. It does not come with its own perspective of
either: what is correct or incorrect; nor: what is good or bad in using it. There is no such kind
of a value-based framework while the companies working on the data store have a
valuebased system for the handling of the information. Anything which is private or
protected is not for anyone to access except the administrator or the data scientist
himself/herself. The problems of the ethical mishandling of data and the relevant solutions to
the problems must be able to amalgamate with the ethics of the companies working in
Bigdata.
The future of the technology world is in the hands of machine learning techniques
with AI based systems where data science is the solution for the data on which these systems
are trained. The data science is kind of fuel for he working intelligent systems as there
training can be done on millions of raw data and all of it can be mined or extracted from the
data science sources through the extractive approach of the data scientists. Data Ethics is a
rapidly improvising field-of-study. The IT professionals and the data scientist work for the
collection, sharing and the manipulation of data which is done by keeping ethics in the
exploration of data and are sometimes even forced to deal with the ethics to avoid any kind of
negative public opinion. Loss of ethics sometimes can prove to be alienating to the reputation
and work culture of any kind of organization.
To acclimatize the organization with the ethical data science, there‘s a need to understand
what actually are the ethics about data science with the cost requirement for the
implementation of ethical values and what can be the ways to execute such practices.
There are several ways and means with the respective solutions for the same.
Firstly, the data scientist needs to understand about the daintiness of the information in
the data store. With this the relevant implementation or the operations can be performed on
50
the data keeping in the view the consequences with improper or unacceptable access of the
data for information. Indeed, the security perspective in computer or network has proved
many times that ignoring the consequences of accessing the data unethically might cause a
trouble to the data professionals or the company as a whole in terms of loss of reputation or
money.
With the available systems it is difficult for any data scientist to predict the inevitable
unethical uses of the data but with the AI based techniques especially using machine learning
and deep learning, a prediction can be made for the unintended consequences. For example,
Facebook‘s ―Year in Review‖ that reminded people of deaths and other painful events. This
can be handled by the data professionals in the field of data science by keeping in view the
patterns of the data to be explored and how to think of better means to represent the data with
a new or enhanced approach.
Another important step to stop the data mining if the administrator finds any kind of
problem in the production line. This idea goes back to Toyota‘s Kanban: any assembly line
worker can stop the line if they see something going wrong. The line doesn‘t restart until the
problem is fixed. Workers don‘t have to fear consequences from management for stopping
the line; they are trusted, and expected to behave responsibly
The issue lurking behind all of these concerns is, of course, corporate culture. Corporate
environments can be hostile to anything other than short-term profitability. That‘s a
consequence of poor court decisions and economic doctrine, particularly in the U.S. But that
inevitably leads us to the biggest issue: how to move the needle on corporate culture. Susan
Etlinger has suggested that, in a time when public distrust and disenchantment is running
high, ethics is a good investment. Upper-level management is only starting to see this;
changes to corporate culture won‘t happen quickly. Users want to engage with companies and
organizations they can trust not to take unfair advantage of them. Users want to deal with
companies that will treat them and their data responsibly, not just as potential profit or
engagement to be maximized. Those companies will be the ones that create space for ethics
within their organizations. We, the data scientists, data engineers, AI and ML developers, and
other data professionals, have to demand change. We can‘t leave it to people that ―do‖
ethics. We can‘t expect management to hire trained ethicists and assign them to our teams.
We need to live ethical values, not just talk about them. We need to think carefully about the
consequences of our work. We must create space for ethics within our organizations. Cultural
change may take time, but it will happen—if we are that change. That‘s what it means to do
good data science
51
Owning' data – who owns data, what's capable of being owned, and what rights and
responsibilities ownership attracts – is gaining a lot of attention.
The data classification is one of the most important steps. Data classification has a
different meaning for different organizations but at the basic level it is knowing the type of
data a company has, determining its value, and categorizing it. For example, if your company
has a secret sauce or original intellectual property it may be considered ―top secret‖ or
―confidential‖. The reason to label or classify it as ―top secret‖ or ―confidential‖ is so it
can be handled and ultimately protected appropriately. In addition to not putting the correct
controls on data there is the potential to retain data for longer than needed or destroy data
before it should be based on laws or contractual commitments. Many people have an opinion
on how data should be classified or labeled but at the end of the day it is the responsibility of
the data owner who is ultimately accountable for the data to make the final decision. The data
owner will have the most knowledge of the use of the data and the value to the company. It is
advisable for the data owner to get input from various sources like the data custodian or data
users but the data owner has complete control over the data.
• Who has access to the data? Clarify the roles of people who can access the data.
Example: Employees can see an organization chart with departments, manager
names, and titles but not salary information (Classification = internal). But a very
limited audience like HR should only have access to salary data, performance
data, or social security numbers (Classification = confidential).
• How is the data secured? Sensitive data elements within HR documentation have
been classified to be confidential and therefore it requires additional security
controls to protect it. Some of the additional controls to secure confidential data
stored in electronic medium could include being saved in a location on the
network with appropriate safeguards to prevent unauthorized access (secure
folders protected by passwords).
• How long the data is retained? Many industries require that data be retained for a
certain length of time. For example, the finance industry requires a seven-year
retention period and some health care industries requires a 100-year retention
period. Data owners need to know the regulatory requirements for their data and
if there is no clear guidance on retention then it should be based off the
company‘s retention policy or best practices.
• How data should be destroyed? Based on the classification of the data there
should be clear guidance on how to dispose or destroy the data. For example:
52
Information Public Internal Confidential
Medium
Hard Copy Place in Place in Place in secured shedding bins or
recycling bins secured manually shred with a cross-cut
or trash shedding bins shredder or pulped. A record must
receptacles. or manually be maintained that indicates the
shred. records disposed of and the data of
disposal.
Electronic Electronic Electronic Electronic records need to be
records records can be records need degaussed off magnetic media
deleted to be after securely deleted or the
normally. overwritten to physical media should be
1‘s and 0‘s or destroyed with a record maintained
with a secure that indicates the records disposed
delete option. of and the date of disposal.
• What data needs to be encrypted? Data owners should decide whether their data
needs to be encrypted. To make this determination the data owner should know
the applicable laws or regulation requirements set that must be complied with. A
good example of a regulation requirement is set by the Payment Card Industry
(PCI) Data Security Standard and it requires that the transmission of cardholder
data across open, public networks must be encrypted.
2. Defining policies — As they have a vested interest in the integrity of their data, owners
focus on defining policies (for example retention or deletion policies) and standards that
ensure the alignment of their data to the GDPR.
3. Creating trusted data — Data ownership is a key ingredient to gain customer trust
and achieve measurable business benefits. Poor data could easily result in bad customer
experiences and ultimately losing customers. In particular, when personal data is not
reconciled into a data subject 360° view, compliance with data subject access rights, such as
rights of portability or rights to be forgotten cannot be fully achieved.
53
This issue stems from multiple teams addressing the same problem either because there isn‘t
a clear understanding of data or they‘re not even aware that the problem has been resolved by
another team. Federated ownership eliminates these painful issues.
Asking the right questions helps. Once these questions have been answered, the data owner
should become clearer:
Proprietary Data – specifically the data that makes up the intellectual capital of individual
businesses and gives them their competitive economic advantage over others, including data
protected under copyright, patent, or trade secret laws and the sensitive, protected data that
companies collect on behalf of its customers;
Infrastructure Data - data from the physical facilities and systems – such as roads, electrical
systems, communications services, etc. – that enable local, regional, national, and
international economic activity; and
Controlled Technical Data - technical, biological, chemical, and military-related data and
research that could be considered of national interest and be under foreign export restrictions.
It may be possible to work with publicly released annualized and cleansed data within these
areas without a problem, but the majority of granular data from which significant insight can
be gleaned is protected. In most instances, scientists, researchers, and other authorized
developers take years to appropriately acquire the expertise, build the personal relationships,
and construct the technical, procedural and legal infrastructure to work with the granular data
before implementing any approach. Even using publicly released datasets within these areas
can be restricted, requiring either registration, the recognition of or affiliation to an
appropriate data governing body, background checks, or all three before authorization is
granted.
54
The second group of data that raises privacy and security concerns is personal data.
Commonly referred to as Personally Identifiable Information (PII), it is any data that
distinguishes individuals from each other. It is also the data that an increasing number of
digital approaches rely on, and the data whose use tends to raise the most public ire. Personal
data could include but is not limited to an individual‘s:
• Government issued record data (social security numbers, national or state identity
numbers, passport records, vehicle data, voting records, etc.);
• Law enforcement data (criminal records, legal proceedings, etc.);
• Personal financial, employment, medical, and education data;
• Communication records (phone numbers, texts data, message records, content of
conversations, time and location, etc.);
• Travel data (when and where traveling, carriers used, etc.);
• Networks and memberships (family, friends, interests, group affiliations, etc.);
• Location data (where a person is and when);
• Basic contact information (name, address, e-mail, telephone, fax, twitter handles, etc.);
• Internet data (search histories, website visits, click rates, likes, site forwards,
comments, etc.);
• Media data (which shows you‘re watching, music you‘re listening to, books or
magazines you‘re reading, etc.);
• Transaction data (what you‘re buying or selling, who you‘re doing business with,
where, etc.); and
• Bio and activity data (from personal mobile and wearable devices).
In industries where being responsible for handling highly detailed personal data is the
established business norm – such as in the education, medical and financial fields – there are
already government regulations, business practices and data privacy and security laws that
protect data from unauthorized usage, including across new digital platforms. But in many
other industries, particularly in data driven industries where personal data has been treated as
proprietary data and become the foundation of business models, there is currently little to no
regulation. In the new normal, the more that a data approach depends on data actively or
passively collected on individuals, the more likely that consumers will speak up and demand
privacy protection, even if they previously gave some form of tacit approval to use their data.
` Despite this new landscape, there are lots of different ways to use personal data, some of
which may not trigger significant privacy or security concerns. This is particularly true in
cases where individuals willingly provide their data or data cannot be attributed to an
individual. Whether individuals remain neutral to data approaches tends to be related to the
level of control they feel they have over how their personal data is used. Some organizations
that collect personal data extensively, such as Facebook and Google, work to increasingly
provide their users with methods to control their own data. But for others, the lack of due
diligence on data privacy in their approaches has already had their effect.
55
A third category of data needing privacy consideration is the data related to good
people working in difficult or dangerous places. Activists, journalists, politicians,
whistleblowers, business owners, and others working in contentious areas and conflict zones
need secure means to communicate and share data without fear of retribution and personal
harm.
Users lose trust because they feel abused by malicious ads; they feel abused by fake and
misleading content, and they feel abused by ―act first, and apologize profusely later‖
cultures at many of the major online companies. And users ought to feel abused by many
abuses they don‘t even know about. Why was their insurance claim denied? Why weren‘t
they approved for that loan? Were those decisions made by a system that was trained on
biased data? The slogan goes, ―Move fast and break things.‖ But what if what gets broken is
society?
Data collection is a big business. Data is valuable: ―the new oil,‖ as the Economist
proclaimed. We‘ve known that for some time. But the public provides the data under the
assumption that we, the public, benefit from it. We also assume that data is collected and
stored responsibly, and those who supply the data won‘t be harmed. Essentially, it‘s a model
of trust. But how do you restore trust once it‘s been broken? It‘s no use pretending that you‘re
trustworthy when your actions have proven that you aren‘t. The only way to get trust back is
to be trustworthy, and regaining that trust once you‘ve lost it takes time.
There‘s no simple way to regain users‘ trust, but a ―golden rule‖ for data as a starting
point: ―treat others‘ data as data scientist would like other to treat their data.‖ However,
implementing a ―golden rule‖ in the actual research and development process is challenging.
The golden rule isn‘t enough by itself. There has to be certain guidelines to force discussions
with the application development teams, application users, and those who might be harmed
by the collection and use of data. Five framing guidelines help us think about building data
products. We call them the five Cs: consent, clarity, consistency, control (and transparency),
and consequences (and harm).
4.6.1 Consent
The trust between the people who are providing data and the people who are using it cannot
be established without agreement about what data is being collected and how that data will be
used . Agreement starts with obtaining consent to collect and use data. Unfortunately, the
agreements between a service‘s users (people whose data is collected) and the service itself
(which uses the data in many ways) are binary (meaning that you either accept or decline)
and lack clarity. In business, when contracts are being negotiated between two parties, there
are multiple iterations (redlines) before the contract is settled. But when a user is agreeing to
56
a contract with a data service, you either accept the terms or you don‘t get access. It‘s
nonnegotiable.
Data is frequently collected, used, and sold without consent. This includes organizations like
Acxiom, Equifax, Experian, and Transunion, who collect data to assess financial risk, but
many common brands also connect data without consent. In Europe, Google collected data
from cameras mounted on cars to develop new mapping products. AT&T and Comcast both
used cable set top boxes to collect data about their users, and Samsung collected voice
recordings from TVs that respond to voice commands.
4.6.2 Clarity
Clarity is closely related to consent. Users must have clarity about what data they are
providing, what is going to be done with the data, and any downstream consequences of how
their data is used. All too often, explanations of what data is collected or being sold are
buried in lengthy legal documents that are rarely read carefully, if at all. Observant readers of
Eventbrite‘s user agreement recently discovered that listing an event gave the company the
right to send a video team, and exclusive copyright to the recordings. And the only way to opt
out was by writing to the company. The backlash was swift once people realized the potential
impact, and Eventbrite removed the language.
Facebook users who played Cambridge Analytica‘s ―This Is Your Digital Life‖ game may
have understood that they were giving up their data; after all, they were answering questions,
and those answers certainly went somewhere. But did they understand how that data might be
used? Or that they were giving access to their friends‘ data behind the scenes? That‘s buried
deep in Facebook‘s privacy settings. It really doesn‘t matter which service you use; you
rarely get a simple explanation of what the service is doing with your data, and what
consequences their actions might have. Unfortunately, the process of consent is often used to
obfuscate the details and implications of what users may be agreeing to. And once data has
escaped, there is no recourse.
57
4.6.4 Control and Transparency
All too often, users have no effective control over how their data is used. They are given all-
or-nothing choices, or a convoluted set of options that make controlling access overwhelming
and confusing. It‘s often impossible to reduce the amount of data collected, or to have data
deleted later. A major part of the shift in data privacy rights is moving to give users greater
control of their data. For example, Europe‘s General Data Protection Regulation (GDPR)
requires a user‘s data to be provided to them at their request and removed from the system if
they so desire.
4.6.5 Consequences
Data products are designed to add value for a particular user or system. As these products
increase in sophistication, and have broader societal implications, it is essential to ask
whether the data that is being collected could cause harm to an individual or a group. The
unforeseen consequences and the ―unknown unknowns‖ about using data and combining
data sets have been witnessed frequently. Risks can never be eliminated completely.
However, many unforeseen consequences and unknown unknowns could be foreseen and
known, if only people had tried. All too often, unknown unknowns are unknown because we
don‘t want to know.
While Strava and AOL triggered a chain of unforeseen consequences by releasing their data,
it‘s important to understand that their data had the potential to be dangerous even if it wasn‘t
released publicly. Collecting data that may seem innocuous and combining it with other data
sets has real-world implications. It‘s easy to argue that Strava shouldn‘t have produced this
product, or that AOL shouldn‘t have released their search data, but that ignores the data‘s
potential for good. In both cases, well-intentioned data scientists were looking to help others.
The problem is that they didn‘t think through the consequences and the potential risks.
Many data sets that could provide tremendous benefits remain locked up on servers. Medical
data that is fragmented across multiple institutions limits the pace of research. And the data
held on traffic from ride-sharing and gps/mapping companies could transform approaches for
traffic safety and congestion. But opening up that data to researchers requires careful
planning.
Reports of AI gone wrong abound, and Responsible AI has started to take a foothold in
business — Gartner has even added Responsible AI as a new category on its Hype Cycle for
Emerging Technologies. Yet when talking about solutions, increasing diversity and making
data science a more inclusive field unfortunately don‘t often top the list. Noelle Silver, Head
of Instruction, Data Science, Analytics, and Full Stack Web Development at HackerU, is
looking to change that.
A 2018 study revealed that only 15% of data scientists are women, and sadly, a 2020 study
found exactly the same results: it seems we haven‘t managed to move the needle. While
diversity obviously encompasses more than just women, few studies have been able to
58
quantify other types of representation in the field. Inclusivity is similarly difficult to quantify,
both in terms of people working on technology and the ways in which technology can be
accessed by all. But that doesn‘t mean there aren‘t solutions. Problem ―The reality is that
when we train machine learning models with a bunch of data, it‘s going to make predictions
based on that data. If that data comes from a room of people that look the same, talk the
same, act the same, they‘re all friends — it‘s not a bad scenario. In the moment, you feel like
things are good. No one is really seeing any problems; you don‘t feel any friction. It‘s very
misleading, especially in artificial intelligence. So, you go to market.
The problem, though, is not everyone looks like you, or talks like you, or thinks like you. So
even though you found a community of people that built this software that thinks the same, as
soon as you go to market and someone other than that starts using it, they start to feel that
friction.‖
Solution
Of course, there‘s no easy, magic bullet solution to this problem, but foundations of a good
start are:
• Committing the time and resources to practice inclusive engineering: This includes,
but certainly isn‘t limited to, doing whatever it takes to collect and use diverse
datasets.
• Create an experience that welcomes more people to the field: This might mean
looking at everything from education to hiring practices.
• Think beyond regulations: Simply being compliant doesn‘t necessarily mean
experiences are optimized.
59
Trend 3: Decision intelligence
Decision intelligence brings together a number of disciplines, including decision management
and decision support. It encompasses applications in the field of complex adaptive systems
that bring together multiple traditional and advanced disciplines. It provides a framework to
help data and analytics leaders design, compose, model, align, execute, monitor and tune
decision models and processes in the context of business outcomes and behavior. Explore
using decision management and modeling technology when decisions need multiple logical
and mathematical techniques, must be automated or semi-automated, or must be documented
and audited.
Trend 4: X analytics
Gartner coined the term ―X analytics‖ to be an umbrella term, where X is the data variable
for a range of different structured and unstructured content such as text analytics, video
analytics, audio analytics, etc.
Data and analytics leaders use X analytics to solve society‘s toughest challenges, including
climate change, disease prevention and wildlife protection with analytics capabilities
available from their existing vendors, such as cloud vendors for image, video and voice
analytics, but recognize that innovation will likely come from small disruptive startups and
cloud providers.
Trend 10: Relationships form the foundation of data and analytics value
Graph analytics is a set of analytic techniques that allows for the exploration of relationships
between entities of interest such as organizations, people and transactions. It helps data and
analytics leaders find unknown relationships in data and review data not easily analyzed with
traditional analytics.
4.9 SUMMARY
When combined with ML algorithms, these technologies can be used to comb through
thousands of data sources and documents that could help medical and public health experts
rapidly discover new possible treatments or factors that contribute to more negative outcomes
for some patients.
Data and analytics leaders need to evaluate opportunities to incorporate graph analytics into
their analytics portfolios and applications to uncover hidden patterns and relationships. In
addition, consider investigating how graph algorithms and technologies can improve your AI
and ML initiatives.
REFERENCES
1. https://www.knowledgehut.com/blog/big-data/5-best-data-processing-frameworks
61
2. Mark J. Costello, Peter T. Harris, Bryony Pearce, Andrea Fiorentino, Jean-François
Bourillet, Sarah M. Hamylton, ―A Glossary of Terminology Used in Marine
Biology, Ecology, and Geology‖, Michael I. Goldstein, Dominick A. DellaSala,
Encyclopedia of the World's Biomes, Elsevier, 2020, Pages 471-478, ISBN
9780128160978, https://doi.org/10.1016/B978-0-12-409548-9.11944-X.
3. Big Data Analytics, Advances in Intelligent Systems and Computing. Springer
Singapore (Print ISBN: 978-981-10-6619-1 Electronic ISBN: 978-981-10-6620-7)
4. https://www.solvexia.com/blog/15-big-data-problems-you-need-to-solve
5. https://blog.quantela.com/the-ethical-challenges-of-a-data-science-practitioner/
6. https://towardsdatascience.com/5-key-ai-problems-related-to-data-
privacyf39558290530
7. https://www.oreilly.com/radar/the-five-cs
8. https://builtin.com/data-science
9. https://www.edx.org/course/subject/data-science
10. https://www.geeksforgeeks.org/types-of-sources-of-data-in-data-mining/
11. https://internetofthingsagenda.techtarget.com/definition/sensor-data
12. https://3bplus.nl/how-do-smart-devices-work-sensors-iot-big-data-and-ai/
13. Introducing Data science, big data, machine learning, and more, using python tools,
Cielen D., Meysman A., Ali M., Manning Publications Co ., 2016, ISBN:
9781633430037
14. https://www.andrew.cmu.edu/user/nbier/15110/lectures/lec15a_sound_video.pdf
15. https://simplicable.com/new/data-artifact
16. https://www.geeksforgeeks.org/different-sources-of-data-for-data-analysis/ 17.
https://nptel.ac.in/courses/106/105/106105174/ 18.
62
B.Sc.(DATA SCIENCE)
SEMESTER-I
5.0 Objectives
5.1 Introduction
5.2 Data wrangling
5.2.1 Steps for data wrangling
5.2.2 Tools for data wrangling
5.3 Combining dataset
5.4 Concatenating dataset
5.5 Merging dataset
5.6 Reshaping dataset
5.6.1 Using melt () function
5.6.2 Using stack () and unstack () function
5.6.3 Using pivot () function
5.7 Data transformation
5.7.1 Data frame creation
5.7.2 Missing value
5.7.3 Encoding
5.7.4 Inset new column
5.7.5 Split column
5.8 String manipulation
5.9 Regular expression
5.9.1 The find all () function
5.9.2 The search () function
5.9.3 The split () function
5.9.4 The sub () function
5.10 Summary
5.11 References
5.0 OBJECTIVES
The main goal of this module is to help students learn, understand and practice the data
science approaches, which include the study of latest data science tools with latest
programming languages. The main objectives of this module are data wrangling which
includes data discovery, structuring, cleaning, enriching, validating and publishing, combining
and merging datasets, data reshaping, pivoting, transformation, string manipulation operations
and regular expression.
5.1 INTRODUCTION
Data science become a buzzword that everyone talks about the data science. Data science is
an interdisciplinary field that combines different domain expertise, computer programming
skills, mathematics and statistical knowledge to find or extract the meaningful or unknown
patterns from unstructured and structure dataset.
Data science is useful for extraction, preparation, analysis and visualization of various
information. Various scientific methods can be applied to get insight in data.
Data science is all about using data to solve problems. Data has become the fuel of industries.
It is most demandable field of 21 st century. Every industry require data to functioning,
searching, marketing, growing, expanding their business.
The application of areas of data science are health care, fraud detection, disease predicting, real
time shipping routes, speech recognition, targeting advertising, gaming and many more.
5.2 DATA WRANGLING
Data wrangling is a key component of any data science project. Data wrangling is a process
where one transforms “row” data for making it more suitable for analysis and it will improve
the quality of the data. Data wrangling is the process of collecting, gathering and transforming
of raw data into appropriate format for accessing, analyzing, easy understanding and further
processing for better and quick decision making. It is also known as Data Munging or Data
Pre-Processing.
Data wrangling is a crucial first steps in the preparation of data for broad analysis of huge
amount of data or big data. It requires significant amount of time and efforts. If data
wrangling is properly conducted, it gives you insights into the nature of the data. It is not just
a single time process but it is an iterative or repetitive process. Each step in the data wrangling
process exposes the new potential ways for the data re-wrangling towards deriving the goal of
complex data manipulation and analysis.
5.2.1 Steps for Data Wrangling
Data wrangling process includes six core activities:
Discovery
Structuring
Cleaning
Enriching
Validating
Publishing
Discovery:
Data discovering is an umbrella term. It describes the process to understand the
dataset and insight into it. It involves the collection and evaluation of data from
various sources and is often used to identify the spot trends and detecting the
patterns of the data and gain instant insight.
Now a day‟s large business organization or companies have a large amount of data
related to the customers, suppliers, production, sells, marketing etc.
For example, if a company have a customer database, you can identify that the most
of the customers are from which part of the city or state or country.
Structuring:
Data is coming from the various sources with the difference formats. Structuring is
necessary because of the different size and shape of the data. Data structuring is
the process or actions that change the form or schema of the dataset. Data splitting
into the columns, deleting some fields from the dataset, pivoting rows are the form
of data structuring.
Cleaning:
Before start the data manipulation or data analysis, you need to perform the data
cleaning. Data cleaning is the process to identify the data quality related issues,
such as missing or mismatched values, duplicate records etc. and apply the
appropriate transformation to correct, replace or delete those values or records
from the dataset to make high quality of data.
Enriching:
The data is useful for decision making process in the business. The data needed to
take business related decision can be stored into multiple files. To gather all
necessary insights into single file and you need to enriched your existing dataset
by performing joining and aggregating multiple data sources.
Validating:
After completion of data cleaning and data enriching, you need to check the
accuracy of the data. If dataset is not accurate, it might be creating a problem. It is
necessary to do the data validation. This is the final check that any missing or
mismatched data was not corrected during the transformation process. It is also
need to validate that output dataset has the intended structure and content before
publishing it.
Publishing:
After successful completion of data discovering, structuring, cleaning, enriching
and validating‟s, it‟s a time to published the wrangled output data for the further
analytics processes, if any. The published data can be uploaded in the
organization‟s software or store into the file in a specific location where
organizations peoples knows it is ready to use.
R
R is also popular, open source and more power tool for data science and
management. It also supports many libraries such as dplyr, tidyr, ggplot2 etc. for
data manipulation and data visualization.
Tabula
Tabula is a tool for liberating data tables trapped inside the PDF files. It allows the
user to upload the files in PDF format and extract the selected rows and columns
from any tables available in the PDF file. It supports to extract this data from PDF
to CSV or Microsoft Excel file format.
DataWrangler
It is an interactive tool for data cleaning. It takes the read word data and transform
it into data tables which can be used for further processing or analysis. It also
supports to export the data tables in Microsoft Excel, R, Tabula etc.
OpenRefine
OpenRefine, previously known as GoogleRefine. It is a Java based open source
powerful tool for manipulates the huge data. It is used for data loading,
understanding, cleaning and transforming from one format to another format. It
also supports to extending the data with web services
.
CSVKit
CSVKit is a suite for command line tools for converting and working with CSV
file. It supports the covert the data from Excel to CSV, JSON to CSV, query with
SQL etc.
Here,
objs: This is a sequence or mapping of Series, DataFrame, objects.
axis: This is an axis to concatenate. This value is {0, 1, ...}, default is 0.
join: This is used how to handle indexes on another axis(es). This value is {„inner‟,
„outer‟}, default is „outer‟. The outer is used for union operation and inner is used for
intersection operation.
join_axes: This is the list of index objects. Its specific indexes to use for the other (n1)
axes instead of performing inner/outer set logic.
ignore_index: This value is Boolean type, default is False. If this value is True, do not
use the index values on the concatenation axis. The resulting axis will be labeled
0, ..., n-1.
keys: This is a sequence to add an identifier to the result indexes, default is None.
Example: Here we perform the concatenation operation using two different data frames
i.e. df1 and df2.
The below code creates the two different data frame df1 and df2.
The above code will give the following output, which concatenate the five records of data
frame df1 and five records of data frame df2 into single data frame df3 with ten records.
The above code will give the following output, which concatenate the data frame df1 and
data frame df2 into single data frame df3 horizontally with axis=1 as an argument.
The above code will give the following output, which concatenate the data frame df1 and
data frame df2 into single data frame df3 with x and y as keys argument.
Now we perform the concatenation with keys and ignore_index as an argument. It follows
its own indexing if we set ignore_index is True.
The below code will perform the concatenation operation on both data frame df1 and df2
with keys and ignore_index as an argument.
The above code will give the following output, which concatenate the data frame df2 with
the data frame df1.
Here,
left: This is the first data frame.
right: This is the second data frame.
how: This is the method how to perform the merge operation. The values of this field
are one of „left‟, „right‟, „inner‟, „outer‟, default is „inner‟.
On: This is the name of column in which action to be perform. This column must be
available in both left and right data frame object.
left_on: This is the name of column from left data frame to use as keys.
right_on: This is the name of column from right data frame to use as keys.
left_index: This is using the index (row label) from left data frame as its join keys, if it
is True.
right_index: This is used the index (row label) from right data frame as its join keys, if
it is True.
sort: This is use to sort the result data frame by join keys in specific order. The value of
this field is Boolean, default is True.
Example: Here we perform the merge operation using two different data frames i.e. left
and right.
The below code creates two different data frames left and right.
Now we perform the merge operation on both data frame using on as an argument.
The below code will perform the merge operation on both data frame left and right using
single on key as an argument.
The above code will give the following output, which merge both data frame left and right
using single on key as an argument.
The above code will give the following output, which merge both data frame left and right
using multiple on key as an argument.
Now we perform the merge operation using how as an argument. This argument specifies how
to determine which keys are to be included in the resulting data frame or table. If the
combination does not appear in any of the data frame or table, NA will be display in joined
table.
The merge methods are same as SQL join equivalent as below:
Merge Method SQL Join Equivalent Description
left Left Outer Join Use keys from left object
right Right Outer Join Use keys from right object
inner Inner Join Use intersection of keys
outer Full Outer Join Use unions of keys
It will represent graphically as follows:
The below code will perform the merge operation on both data frame left and right on course
using how=„left‟ method.
The above code will give the following output, which merge both data frame left and right
using left join. It will display left data frame records plus common records as follows:
The below code will perform the merge operation on both data frame left and right on course
using how=„right‟ method.
The above code will give the following output, which merge both data frame left and right
using right join. It will display right data frame records plus common records as follows:
The below code will perform the merge operation on both data frame left and right on course
using how=„inner‟ method.
The above code will give the following output, which merge both data frame left and right
using inner join. It performs the intersection operation on both data frame, which will display
only common records as follows:
The below code will perform the merge operation on both data frame left and right on course
using how=„over‟ method.
The above code will give the following output, which merge both data frame left and right
using outer join. It performs the union operation on both data frame, which will display left
data frame records, right data frame records and common records as follows:
The data frame will be reshaped by using melt(), stack(), unstack() and pivot() functions as
follows:
The below code creates and display data frame df1.
# Dataframe creation df
= pd.DataFrame({
"Rno":[1, 2, 3, 4,
5],
"Name":["Rajan","Shital","Mayur","Mittal","Mahesh"],
"Age":[25, 27, 24, 25, 21]})
# Display dataframe df
Variable value
0 Rno 1
1 Rno 2
2 Rno 3
3 Rno 4
4 Rno 5
5 Name Rajan
6 Name Shital
7 Name Mayur
8 Name Mittal
9 Name Mahesh
10 Age 25
11 Age 27
12 Age 24
13 Age 25
14 Age 21
0 Rno 1
Name Rajan
Age 25
1 Rno 2
Name Shital
Age 27
2 Rno 3
Name Mayur
Age 24
3 Rno 4
Name Mittal
Age 25
4 Rno 5
Name Mahesh
Age 21
dtype: object
The unstack() function is used to do the revert back changes in a data frame was
perform by the stack() function.
The below code will perform the reshape operation using unstack function.
Rno 0 1
1 2
2 3
3 4
4 5
Name 0 Rajan
1 Shital
2 Mayur
3 Mittal
4 Mahesh
Age 0 25
1 27
2 24
3 25
4 21
dtype: ect
obj
Name Ag
e
Rn 1 2 3 4 5 1 2 3 4 5
o
0 Raja Na NaN NaN NaN 25. Na Na Na Na
n N 0 N N N N
1 NaN Shit NaN NaN NaN Na 27. Na Na Na
al N 0 N N N
2 NaN Na Mayu NaN NaN Na Na 24. Na Na
N r N N 0 N N
3 NaN Na NaN Mitt NaN Na Na Na 25. Na
N al N N N 0 N
4 NaN Na NaN NaN Mahes Na Na Na Na 21.
N h N N N N 0
# Dataframe creation df
= pd.DataFrame({
"Name":["Jayesh Patel","Priya Shah","Vijay Sharma"],
"Gender": ["Male","Female","Male"],
"City": ["Rajkot","Delhi","Mumbai"]})
# Display dataframe df
The above code will create and display data frame as follows:
df.isna().sum()
Here there are no any missing values in each column so it returns zero value in each column.
If there are any missing values in each column, it returns the number of missing values.
Name 0
Gender 0 City
0 dtype: int64
The below code will drop all the rows which containing missing value.
df = df.dropna()
The below code will drop the columns where all elements are missing values.
df.dropna(axis=1, how='all')
The below code will drop the columns where any of the elements containing missing values.
df.dropna(axis=1, how='any')
The below code will keep only the rows which contains maximum two missing values.
df.dropna(thresh=2)
The below code will fill all missing values with mean value of the particular column.
df.fillna(df.mean())
The below code will fill any missing values in specified column with median value of the
particular column. Here we taken “Age” column for example.
df['Age'].fillna(df['Age'].median())
The below code will fill any missing values in specified column with mode value of the
particular column. Here we taken “Age” column for example.
df['Age'].fillna(df['Age'].mode())
5.7.3 Encoding
The dataset contains both numerical and categorical value. The categorical data is not much
useful for data processing or analytics. It is needed to encoding the categorical value into
numeric value.
Data encoding is the process to convert a categorical variable into a numerical form.
Here we discuss the label encoding which is simply converting each value in a column to a
number. Our dataset has “Gender” column which has only two values “male” and “female”
It encodes like this:
Male 0
Female 1
df=df.Gender.replace({"Male":0,"Female":1})
df
0 0
1 1
2 0
Name: Gender, dtype: int64
The above code will insert a new column “Age” with the values 21, 23 and 24
respectively at second column in a data frame. The newly added column data frame
will display as follow:
Name Age Gender City
0 Jayesh Patel 21 Male Rajkot
1 Priya Shah 23 Female Delhi
2 Vijay Sharma 24 Male Mumbai
The above code will create two different columns (i.e. First Name, Last Name) from
the single column “Name” of dataset. The newly split data frame will be display as
follow:
Function Description
findall It returns a list of all match values.
search It returns a match object if match found anywhere in the string.
split It returns a list and spit the string where each match found
sub It replaces the one or more matches with a specified string.
Example:
import re
string = "Working with Data Science using Python"
result = re.findall("th",string) result
The list contains all match values in an order of they are found. Output:
['th', 'th']
Example:
import re
string = "Working with Data Science using Python"
result = re.findall("ds",string) result
Example:
import re
string = "Working with Data Science using Python" result
= re.search("wi",string)
result
Output:
<re.Match object; span=(8, 10), match='wi'>
Example:
import re
string = "Working with Data Science using Python"
result = re.search("\s",string) result
Example:
import re
string = "Working with Data Science using Python"
result = re.split("\s",string) result
It found the match value “\s” i.e. space. Output:
['Working', 'with', 'Data', 'Science', 'using', 'Python']
Example:
import re
string = "Working with Data Science using Python"
result = re.split("\s",string,2) result
It found the match value “\s” i.e. space. But here the string will split into first 2
occurrence of found only. The remaining string will print as it is. Output:
['Working', 'with', 'Data Science using Python']
Example:
import re
string = "Working with Data Science using Python"
result = re.sub("\s", "-",string) result
It found the match value “\s” i.e. space. Every “\s” (space) will replace with the “-” (dash).
Output:
'Working-with-Data-Science-using-Python'
Example:
import re
string = "Working with Data Science using Python"
result = re.sub("\s","-",string,2) result
It found the match value “\s” i.e. space. But here “\s” (space) will replace with the “-”
(dash) in first two occurrence of found only. The remaining string will print as it is.
Output:
'Working-with-Data Science using Python'
5.10 SUMMARY
The students will learn many things related to data pre-processing in this module and they
will be able to perform the various data science related operation using Python.
Ability to do the data wrangling which includes data discovery structuring, cleaning,
enriching, validating and publishing.
Ability to do the combining different datasets using concat, merge and join function
with different arguments.
Ability to do the reshaping of dataset using melt, stack and unstack and pivot
functions.
Ability to work with missing value, encoding categorical data into numerical value,
splitting dataset etc.
Ability to perform the various functions related to string and regular expression.
REFERENCES
Books
1. Davy Cielen, Arno D. B. Meysman, Mohamed Ali : Introducing Data Science, Manning
Publications Co.
2. Stephen Klosterman (2019) : Data Science Projects with Python, Packt Publishing
3. Jake VanderPlas (2017) : Python Data Science Handbook: Essential Tools for
Working with Data, O‟Reilly
4. Wes McKinnery and Pandas Development Team (2021) : pandas : powerful Python data
analysis toolkit, Release 1.2.3
Web References
1. https://www.geeksforgeeks.org
2. https://www.tutorialspoint.com
3. https://www.w3schools.com
4. https://pandas.pydata.org
5. https://pbpython.com
QUESTIONS
Short Answer:
Long Answer:
PRACTICALS
B.Sc.(DATA SCIENCE)
SEMESTER-I
STRUCTURE
6.0 Objects
6.1 Introduction
6.2 Data aggregation
6.3 Group wise operation
6.4 Transformation
6.5 Pivot table
6.6 Cross tabulations
6.7 Date and time data type
6.8 Summary
6.9 References
6.0 OBJECTIVES
The main goal of this module is to help students learn, understand and practice the data
science approaches, which include the study of latest data science tools with latest
programming . The main objectives of this module are data aggregation, group wise
operation including data splitting, applying and combining, data transformation using lamda
function, pivot table, cross tabulation using two-way and three-way cross table and data and
time data type.
6.1 INTRODUCTION
Data science become a buzzword that everyone talks about the data science. Data science is
an interdisciplinary field that combines different domain expertise, computer programming
skills, mathematics and statistical knowledge to find or extract the meaningful or unknown
patterns from unstructured and structure dataset.
Data science is useful for extraction, preparation, analysis and visualization of various
information. Various scientific methods can be applied to get insight in data.
Data science is all about using data to solve problems. Data has become the fuel of industries.
It is most demandable field of 21st century. Every industry require data to functioning,
searching, marketing, growing, expanding their business.
The application of areas of data science are health care, fraud detection, disease predicting, real
time shipping routes, speech recognition, targeting advertising, gaming and many more.
6.2 Data Aggregation
Data aggregation is the process to gather the raw data and express in a summarised form for
statistical analysis. Data may be collected from various sources and combine into a summary
format for data analysis.
A dataset contains large amount of data in a rows and columns. There are thousands or more
data records are in a single dataset. Data aggregation will useful to access and process the
large amount of data quickly. Aggregate data can be access quickly to gain insight instead of
accessing all the data records. A single raw of aggregated data can represent this large number
of data records over a given time period to calculate the statistics such as sum, minimum,
maximum, average and count.
Sum : This function add all the specified data to get a total.
Min : This function displays the lowest value of each specified category.
Max : This function displays the highest value of each specified category.
Average : This function calculates the average value of the specific data.
Count : This function counts the total number of data entries for each category.
Data aggregation provides more insight information based on related cluster of data. For
example, a company want to know the sales performance of different district, they would
aggregate the sales data based on the district. Data can aggregate by date also, if you want to
know the trends over a period of months, quarters, years, etc.
Example: Here, we will perform the aggregation operation on data frame.
The below code will create and display data frame df.
# Dataframe creation df
= pd.DataFrame({
"Rno":[1,2,3,4,5,6,7,8,9,10],
"Maths":[67,83,74,91,55,70,86,81,92,67],
"Physics":[56,67,72,84,89,79,90,89,92,82],
"Chemistry":[81,88,78,69,74,72,83,90,58,68],
"Biology":[90,83,86,75,68,79,67,71,91,89],
"English":[60,55,63,71,88,75,91,82,85,80]})
# Display dataframe df
0 1 67 56 81 90 60
1 2 83 67 88 83 55
2 3 74 72 78 86 63
3 4 91 84 69 75 71
4 5 55 89 74 68 88
5 6 70 79 72 79 75
6 7 86 90 83 67 91
7 8 81 89 90 71 82
8 9 92 92 58 91 85
9 10 67 82 68 89 8
df.aggregate(["min","max"])
The above code will give the following output.
Rno Maths Physics Chemistry Biology English
Min 1 55 56 58 67 55
Max 10 92 92 90 91 91
The below code will find min and max value and calculate average and sum value of
different subject of data frame df.
The below code will find min and max value of “Maths” subject max and sum of
“Physics” subject and min, median and std of “English” subject of data frame df.
df.agg({"Maths":["min","max"],
"Physics":["max","sum"],
"English":["min","median","std"]})
Rno 55
Maths 766
Physics 800
Chemistry 761
Biology 799
English 750
dtype: int64
df.min()
Rno 1
Maths 55
Physics 56
Chemistry 58
Biology 67
English 55
dtype: int64
df.max()
Rno 10
Maths 92
Physics 92
Chemistry 90
Biology 91
English 91
dtype: int64
df.mean()
Rno 5.5
Maths 76.6
Physics 80.0
Chemistry 76.1
Biology 79.9
English 75.0
dtype: float64
df.count()
Rno 10
Maths 10
Physics 10
Chemistry 10
Biology 10
English 10
dtype: int64
Rno 3.027650
Maths 11.992590
Physics 11.718931
Chemistry 9.859570
Biology 9.230986
English 12.400717
dtype: float64
df.describe()
Here,
by: mapping, function, label, str
axis: int, 0 or index and 1 or columns, default is 0, split along rows (0) or columns (1).
level: int, level name, default is None, If the axis is a MultiIndex, group by a particular
level or levels.
as_index: boolean, default is True, for aggregated output, return object with group
labels as the index.
sort: boolean, default is True, sort group keys.
group_keys: boolean, default is True, when calling apply, add group keys to index to
identify pieces.
squeeze: boolean, default is False, reduce the dimensionality of the return type, if
possible
Example: Here, we will perform the groupby operation on data frame.
The below code will create and display data frame df.
# Dataframe creation df
= pd.DataFrame({
"Product":
["Mango","Corn","Orange","Cabbage","Mango","Corn","Watermelon","App
le","Pumkin","Mango",],
"Category":["Fruit","Vegetable","Fruit","Vegetable","Fruit","Vegetable","Fruit","Fruit
","Vegetable","Fruit"],
"Qty":[12, 5, 10, 2, 10, 3, 5, 8, 2, 10],
"Price":[350, 80, 320, 50, 200, 50, 280, 380, 60, 400]})
# Display dataframe df
The above code will give the following output.
1 Corn Vegetable 5 80
3 Cabbage Vegetable 2 50
5 Corn Vegetable 3 50
8 Pumkin Vegetable 2 60
The below code will find sum of “Qty” and “Price” based on “Category” on data frame df.
df.groupby("Category").sum()
Category
Fruit 55 1930
Vegetable 12 240
The below code will find sum of “Qty” and “Price” based on different “Product” on data
frame df.
df.groupby("Product").sum()
Product
Apple 8 380
Cabbage 2 50
Corn 8 130
Mango 32 950
Orange 10 320
Pumkin 2 60
Watermelon 5 280
Now, we will perform the groupby operation on “Category” columns.
The below code will find mean of “Price” based on “Category” on data frame df.
df.groupby("Category")["Price"].mean()
Category
Fruit 321.666667
Vegetable 60.000000
Name: Price, dtype: float64
The below code will find mean of “Qty” based on “Category” on data frame df.
df.groupby("Category")["Qty"].mean()
Category
Fruit 9.166667
Vegetable 3.000000
Name: Qty, dtype: float64
The below code will find mean of “Price” based on “Product” on data frame df.
df.groupby("Product")['Price'].mean()
Product
Apple 380.000000
Cabbage 50.000000
Corn 65.000000
Mango 316.666667
Orange 320.000000
Pumkin 60.000000
Watermelon 280.000000
Name: Price, dtype: float64
The below code will find median of “Price” based on “Category” on data frame df.
df.groupby("Category")["Price"].median()
Category
Fruit 335
Vegetable 55
Name: Price, dtype: int64
The below code will find standard deviation of “Price” based on “Category” on data frame
df.
df.groupby("Category")["Price"].std()
The above code will give the following output.
Category
Fruit 73.325757
Vegetable 14.142136
Name: Price, dtype: float64
6.4 Transformation
Transformation is a process in which we perform some group-specific computations and
return a like-indexed. Transformation perform on a group or a column which returns an object
that is indexed the same size of that is being grouped. Thus, the transform should return a
result that is the same size as that of a group chunk.
Syntax:
Example: Here, we will perform some group specific operation and return a likeindexed.
The below code will create and display data frame df.
# Display dataframe df
A B C
0 8 NaN 10.0
1 7 22.0 6.0
2 15 32.0 NaN
3 12 9.0 8.0
4 15 7.0 14.0
The below code will multiply by 5 to each value of all the columns A, B and C of data
frame df.
A B C
0 40 NaN 50.0
1 35 110.0 30.0
2 75 160.0 NaN
3 60 45.0 40.0
4 75 35.0 70.0
The below code will calculate the square root of value of all the columns A, B and C of
data frame df.
A B C
The below code will calculate the exponential of value of all the columns A, B and C of
data frame df.
A B C
exp exp exp
0 2.980958e+03 NaN 2.202647e+04
1 1.096633e+03 3.584913e+09 4.034288e+02
2 3.269017e+06 7.896296e+13 NaN
3 1.627548e+05 8.103084e+03 2.980958e+03
4 3.269017e+06 1.096633e+03 1.202604e+06
The below code will calculate both square root and exponential together of value of all the
columns A, B and C of data frame df.
A B C
The below code will create and display data frame df.
# Display dataframe df
Team Score
0 MI 210
1 CSK 150
2 RR 215
3 MI 180
4 KKR 185
5 KKR 205
6 MI 230
7 CSK 190
8 KKR 160
9 RR 185
The below code will perform the transformation operation using groupby function.
Here, groupby function will apply on “Team” and calculate the “sum” of “Score” using
transform function.
df.groupby("Team")["Score"].transform("sum")
Here,
data: dataframe object
index: a column which has the same length as data. Keys to group by on the pivot table
index.
columns: a column which has the same length as data. Keys to group by on the pivot
table column.
values: column or list of columns to aggregate
aggfunc: function to use for aggregation
The below example will create a pivot table using pivot_table() function of pandas library
of Python.
Example :
The below code will create and display data frame df.
# Importing pandas library
import pandas as pd
# Dataframe creation df
= pd.DataFrame({
"Product":
["Mango","Corn","Orange","Cabbage","Mango","Corn","Watermelon","App
le","Pumkin","Mango"],
"Category":["Fruit","Vegetable","Fruit","Vegetable","Fruit","Vegetable","Fruit","Fruit
","Vegetable","Fruit"],
"Qty":[12, 5, 10, 2, 10, 3, 5, 8, 2, 10],
"Price":[350, 80, 320, 50, 200, 50, 280, 380, 60, 400]})
# Display dataframe df
Example:To create a pivot table of total of
sales
each product.
Here, we set the index as a “Product” and “sum” as an aggregate function to calculate the
total sales of each product. The sum function will do the summation of each products.
The above code will give the following output.
Price
Product
Apple 380
Cabbage 50
Corn 130
Mango 950
Orange 320
Pumkin 60
Watermelon 280
Example:
To create a pivot table of total sales of each category.
Here, we set the index as a “Category” and “sum” as an aggregate function to calculate
the total sales of each category. The above code will give the following output.
Price
Category
Fruit 1930
Vegetable 240
Example:
To create a pivot table of total sales of each product.
Here, we set the index as a both “Category” and “Product” and “sum” as an aggregate
function to calculate the total sales of each product. The above code will give the following
output.
Price
Category Product
Fruit Apple 380
Mango 950
Orange 320
Watermelon 280
Vegetable Cabbage 50
Corn 130
Pumkin 60
Example:To create a pivot table to find the minimum, maximum, mean and median of
price of each category wise.
Here, we set the index as a “Category” and “min”, “max”, “mean” and “median” as an
aggregate function to calculate the minimum, maximum, mean and median of price of each
category wise. The above code will give the following output.
Price
max mean median min
Category
Fruit 400.0 321.666667 335.0 200.0
Vegetable 80.0 60.000000 55.0 50.0
Here, we set the index as a both “Category” and “Product” and “count” as an aggregate
function to count total product of each category. The above code will give the following
output.
count
Price
Category Product
Fruit Apple 1
Mango 3
Orange 1
Watermelon 1
Vegetable Cabbage 1
Corn 2
Pumkin 1
Here,
index: array-like, values to group by in the rows.
columns: array-like, values to group by in the columns.
values: array-like, optional, array of values to aggregate according to the factors.
rownames: sequence, must match number of row arrays passed, default is None
colnames: sequence, must match number of column arrays passed if passed, default is
None
aggfuncs: function, optional, if no values array is passed, its computers a frequency
table.
margins: boolean, add row / column margins (i.e. subtotals), default is False.
margins_name: string, name of the row / column that will contain the subtotals if
margins is True, default is “All”.
normalize: boolean, {„all‟, ‟index‟, ‟columns‟}, or {0,1}, normalize by dividing all
values by the sum of values, default is False.
dropna: boolean, do not include columns whose all entries are NaN, default is True.
The below example will create a cross-table using crosstab() function of pandas library of
Python.
Example :
The below code will create and display data frame df.
# Dataframe creation df
= pd.DataFrame({
"Name":["Rahul","Jyoti","Rupal","Rahul","Jyoti","Rupal",
"Rahul","Jyoti","Rupal","Rahul","Jyoti","Rupal","Rahul",
"Jyoti","Rupal","Rahul","Jyoti","Rupal"],
"Examination":["SEM-I","SEM-I","SEM-I","SEM-I","SEM-I",
"SEM-I","SEM-I","SEM-I","SEM-I","SEM-II","SEM-II",
"SEM-II","SEM-II","SEM-II","SEM-II","SEM-II","SEM-II", "SEM-
II"],
"Subject":["Physics","Physics","Physics","Chemistry",
"Chemistry","Chemistry","Biology","Biology","Biology",
"Physics","Physics","Physics","Chemistry","Chemistry",
"Chemistry","Biology","Biology","Biology"],
"Result":["PASS","PASS","FAIL","PASS","FAIL","PASS","FAIL",
"PASS","FAIL","PASS","PASS","PASS","FAIL","PASS","PASS",
"PASS","PASS","FAIL"]})
# Display dataframe df
There are two columns is used to create a cross table is called two-way cross table.
Here, we will create a cross table of two columns “Subject” and “Result” as follow:
Here, we set margin=True to display the row wise sum and column wise sum of the cross
There are three columns is used to create a cross table is called three-way cross table.
Here, we will create a cross table of three columns “Subject”, “Examination” and “Result”
as follow:
Here, we set margin=True to display the row wise sum and column wise sum of the cross
Date It is a date type object. It manipulates only date (i.e. day, month and
year).
It is a time object class. It manipulates only time of the any specific day
Time (i.e. hour, minute, second, microsecond).
It manipulates the combination of both time and date (i.e. day, month,
Datetime year, hour, second, microsecond).
It manipulates the duration expressing the different between two dates,
timedelta times or datetime values in milliseconds.
Tzinfo It is an abstract base class which provides time zone information.
It is a class that implements tzinfo abstract base class as a fixed offset
timezone
from the UTC.
There are different format codes is used to formatting the data and time.
The format codes are as follows:
Tue Mar 30
%c Local version of date and time 13:25:30
2021
%x Local version of date (MM/DD/YY) 11/24/2021
%X Local version of time (HH:MM:SS) 18:25:40
To get more insight and work with datetime modules, let‟s take few examples.
Example: To get current data and time.
# To get the current date and time
import datetime as dt
d = dt.datetime.now()
print(d)
2021-05-15 00:53:08.925167
Here, the now() function is used to display the current local date and time.
Example: To get the current date.
2021-05-15
Here, the today() function is used to get the current local date.
Example:
To get the todays date.
# To get
the today‟s date, month and year
import separately
datetime as dt
today = dt.date.today()
print(today)
print("Day :",today.day)
print("Month :",today.month)
print("Year :",today.year)
2021-05-15
Day : 15
Month : 5
Year : 2021
Here, the today() function is used to get the current local date and display day, month and
year separately.
Example: To represent a date using date object.
# To represent a date using date object
import datetime as dt
d = dt.date(2021, 1, 26)
print(d)
The above code will give the following output.
2021-01-26
2001-04-19
Here, the fromtimestamp() function is used converts seconds into equivalent date.
Example: To represent a time using time object.
Time : 00:00:00
Time : 10:40:55
Time : 10:40:55
Time : 10:40:55.123456
Hour : 10
Minute : 40
Second : 55
Microsecond : 123456
Here, the time() function with different arguments is used to get the time
in different
formats.
Example: To represent a datetime object.
# datetime(year,month,day,hour,minute,second,microsecond)
dtformat = dt.datetime(2021, 5, 15, 16, 35, 25, 234561)
print(dtformat)
2021-05-15 00:00:00
2021-05-15 16:35:25.234561
Here, the datetime() function is used to display the dates in different formats.
Example: To represent a datetime object using different format.
Year : 2021
Month 5:
Day : 15
Hour : 16
Minute : 35
Timestamp : 1621076725.234561
Here, the datetime() function is used to display the dates separately in year, month, day,
hours, minutes, seconds etc.
Example:
To find the difference between to dates and times.
Here, the datetime() function is used to display the dates and perform the subtraction
operation between two different dates.
Example: To use of timedelta object.
Here, the total_seconds() function is used to convert given times into second format.
Example: To use strftime() function for formatting.
Here, the strftime() function is used to display the dates in different format.
Here, the strptime() function is used to string format date into object format date.
Example: To use different timezone.
Here, the pytz is used to set different timezone and strftime() function is used to display
the dates in different format.
6.8 SUMMARY
The students will learn many things related to data aggregation and group wise operations
in this module and they will be able to perform the various data science related operation
using Python.
Ability to perform the data aggregation using various functions such as min, max,
sum, average, count etc.
Ability to perform group wise operation on specific group such as splitting, applying
and combining to calculate the mean, median, standard deviation group wise.
Ability to do the statistical summary table using pivoting.
Ability to work with cross tabulation using both two-way and three-way cross table.
Ability to perform various operation on date and time data types.
REFERENCES
Books
5. Davy Cielen, Arno D. B. Meysman, Mohamed Ali : Introducing Data Science, Manning
Publications Co.
6. Stephen Klosterman (2019) : Data Science Projects with Python, Packt Publishing
7. Jake VanderPlas (2017) : Python Data Science Handbook: Essential Tools for
Working with Data, O‟Reilly
8. Wes McKinnery and Pandas Development Team (2021) : pandas : powerful Python data
analysis toolkit, Release 1.2.3, 2021
Web References
6. https://www.geeksforgeeks.org
7. https://www.tutorialspoint.com
8. https://www.w3schools.com
9. https://pandas.pydata.org
10. https://pbpython.com
QUESTIONS
Short Answer:
Long Answer:
PRACTICALS
B.Sc.(DATA SCIENCE)
SEMESTER-I
7.0 Objectives
7.1 Introduction
7. Generative Modeling
7.3 Predictive Modeling
7.3.1 Models
7.3.2 Predictive algorithms
7.4 Charts
7.4.1 Histogram
7.4.2 Scatter Plot
7.4.3 Line Chart
7.4.4 Bar Chart
7.5 Graph
7.6 3-D Visulization and Presentation
7.6.1 3d line plot
7.6.2 3d scatter plot
7.6.3 3d bar plot
7.6.4 wire plot
7.6.5 surface plot
7.7 Summary
7.0 OBJECTIVES
The main goal of this module is to help students learn, understand and practice the data
science approaches, which include the study of latest data science tools with latest
programming languages. The main objectives of this module are data modeling which
includes the basics of generative modeling and predictive modeling techniques and data
visualization which include different types of charts and plots like histogram, scatter plot,
time series plot etc.
7.1 INTRODUCTION
Data science become a buzzword that everyone talks about the data science. Data science is
an interdisciplinary field that combines different domain expertise, computer programming
skills, mathematics and statistical knowledge to find or extract the meaningful or unknown
patterns from unstructured and structure dataset.
Data science is useful for extraction, preparation, analysis and visualization of various
information. Various scientific methods can be applied to get insight in data.
Data science is all about using data to solve problems. Data has become the fuel of industries.
It is most demandable field of 21st century. Every industry require data to functioning,
searching, marketing, growing, expanding their business.
The application of areas of data science are health care, fraud detection, disease predicting, real
time shipping routes, speech recognition, targeting advertising, gaming and many more.
7.2 INTRODUCTION TO GENERATIVE MODELING
Generative models are the family of machine learning models that are used to describe how
data is generated. There are mainly two different types of problems to work with machine
learning or deep learning algorithms such as supervised learning and unsupervised learning.
In supervised learning problem, we have two variables such as independent variables (x) and
the target variable (y). The examples of supervised learning are classification, regression,
object detection etc.
In unsupervised learning problem, we have only independent variables (x). there are no target
variable or label. It aims is to find some underlying patterns from the dataset. The examples of
unsupervised learning are clustering, dimensionality reduction etc.
The generative model is an unsupervised learning problem in machine learning. It
automatically discovers and learning the rules, regularities or patterns from the large input
training dataset. This model learns to create a data that is look like as given. A generative
model can be broadly defined as follows:
A generative model describes how a dataset is generated, in terms of a probabilistic model. By
sampling from this model, we are able to generate new data.
Generative model can generate new data instances. If we have a dataset containing images of
any animal and we may develop a model which can generate a new image of same animal that
is never existing but still it looks like as real animal. This model has learned the general rules
that govern the appearance of a specific animal. A generative modelling is used to solve this
kind of problems. A generative model process are as follows:
We require a dataset which consists many instants of the entity which we want to generate.
This dataset is known as the training data and each data points is called as an observation. The
following diagram represent a horse animal dataset.
Classification Model
This is a simplest and most commonly used predictive analytics model. It works on
categorical information based on historical data.
This model is used or apply in many industrial applications because it can easily
retrain with new data as per the needs.
Clustering Model
This model is use to take the data and divide it into different nested smart groups
based on some common attributes. It helps to divide or grouping things or data with
shared characteristic or behaviors and take strategic decisions for each group. For
example, the customers can be divided based on common attributes like purchasing
methods, purchasing power, etc. for targeted marketing campaign to the customers.
Forecast Model
This is a very popular and most widely use model. It works with the metric value
prediction, by estimating the value of new data based on learnings from historical
data. It is also used to generate the numerical values and update where none or
missing value found. This model can be applied wherever historical numerical data
is available. It considers multiple input parameters.
This model is used in many different business and industries. For example, the
company‟s customer care department can predict how many supports calls they
will receive per day.
Time Series Model
This model is focusses on data where time is an input parameter. This model is
applied by using different data points which is taken from the previous year‟s data
to develop a numerical metric that will used to predict the trends within a specified
period of time.
This model is used in many industries which want to see how a particular variable
change over a time period. It also takes care about extraneous factors that might be
affect the variable such as seasons or seasonal variable. For example, the shopping
mall owners want to know the how many customers may visit the mall in week or
month.
Outliers Model
This model is work with anomalous data entries in a dataset. It works by finding
unusual data, either in isolation or in relation with different categories and numbers.
It is more useful in industries were identifying anomalies can save organization
corers of rupees such as finance and retail. It is more effective in fraud detection
because it can find the anomalies. Since an incidence of fraud is a deviation from
the norm, this model is more likely to predict it before it occurs. For example, when
identifying a fraud transaction, this model can assess the amount of money lost,
purchase history, time, location etc.
7.3.2 Predictive Algorithms
The predictive analytics algorithms can be separated into two things: machine learning
and deep learning. These both are subsets of artificial intelligence (AI).
Machine Learning: It deals structural data such as table or spreadsheets. It has both
linear and non-linear algorithms. Linear algorithms are quickly train, while non-linear
are better optimized for the problems they are to face.
Deep Learning: It is a subset of machine learning. It deals with unstructured data such
as images, social media posts, text, audio and videos.
There are several algorithms can be used for machine learning predictive modelling.
The most common algorithms are:
Random Forest
Generalized Linear Model (GLM) for Two Values
Gradient Booster Model (GBM)
K-Means
Prophet
7.4 CHARTS
Charts is the representation of data in a graphical format. It helps to summarizing and
presenting a large amount of data in a simple and easy to understandable formats. By placing
the data in a visual context, we can easily detect the patterns, trends and correlations among
them.
Python provides various easy to use multiple graphics libraries for data visualization with
different features. These libraries are work with both small and large datasets.
Python has multiple graphics libraries with different features. Some of the most popular and
commonly used Python data visualization libraries are :
Matplotlib
Pandas
Seaborn
ggplot
Plotly
Matplotlib is a most popular, amazing and multi-platform data visualization library available
in Python. Matplotlib consists a wide variety of plots like histogram, scatter plot, line or time
series plot, bar chart etc.
7.4.1 Histogram
• bins: numbers of bin in the plot, optional range: lower and upper range of the
bins.
• density: density or count to populate the plot
• histtype: types of histogram plot such as bar, step and stepfilled, default is bar
• align: control to plot the histogram such as left, mid and right
• rwidth: relative width of a bar as a fraction of bin width
• color: it is a color spec or sequence of color specs
• orientation: horizontal or vertical representation, default is vertical Example: Here
we take an example of age of peoples.
The x-axis represents age group or bins and y-axis represents age.
# Importing library
import matplotlib.pyplot as plt
# Data values
age = [22, 55, 62, 45, 21, 22, 4, 12, 14, 64, 58, 68, 95, 85, 55, 38, 18, 37, 65, 59, 11,
15, 80, 75, 65]
bins = [0,10,20,30,40,50,60,70,80,90,100]
# Show plot
plt.show()
age = [22, 55, 62, 45, 21, 22, 34, 42, 42, 4, 2, 102, 95, 85, 55, 110, 120, 7, 65, 55,
111, 115, 80, 75, 65, 4, 44, 43, 42, 48]
bins = [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
Here, the type of histogram is set to step and color is set to red.
We can also create the multiple histogram of different columns as follows:
Here, we create a data frame with three different columns such as “Age”, “Height”
and “Weight”.
import pandas as pd
df = pd.DataFrame({
"Age":[25,38,45,29,65,52,46,72,28,35],
"Height": [145,138,160,180,165,170,158,162,171,168],
"Weight": [75,90,85,72,68,76,82,96,63,79]})
hist = df.hist(bins=10)
# Importing library
import matplotlib.pyplot as plt
# Data values
Boys_Weight = [67, 89, 72, 114, 65, 80,
91, 106, 60, 59] Girls_Weight = [50, 59,
63, 40, 92, 88, 64, 45, 52, 59]
# Show plot
plt.show()
In above plot, we can see the relationship between boys and girls weight.
We can also compare the boys weight and girls weight of one class with another class.
Boys_Weight = [67, 89, 72, 114, 65, 80, 91, 106, 60, 59] Girls_Weight
= [50, 59, 63, 40, 92, 88, 64, 45, 52, 68] plt.scatter(Boys_Weight,
Girls_Weight)
Boys_Weight = [54, 67, 92, 56, 83, 65, 89, 78, 50, 49] Girls_Weight
= [41, 79, 56, 74, 76, 73, 74, 87, 82, 63] plt.scatter(Boys_Weight,
Girls_Weight)
plt.title("Scatter Plot")
plt.xlabel("Boys Weight")
plt.ylabel("Girls Weight")
plt.show()
Boys_Weight = [67, 89, 72, 114, 65, 80, 91, 106, 60, 59]
Girls_Weight = [50, 59, 63, 40, 92, 88, 64, 45, 52, 68]
plt.scatter(Boys_Weight, Girls_Weight, color="red")
Boys_Weight = [54, 67, 92, 56, 83, 65, 89, 78, 50, 49]
Girls_Weight = [41, 79, 56, 74, 76, 73, 74, 87, 82, 63]
plt.scatter(Boys_Weight, Girls_Weight, color="Green")
plt.title("Scatter Plot")
plt.xlabel("Boys_Weight")
plt.ylabel("Girls_Weight")
plt.show()
Boys_Weight = [67, 90, 72, 114, 67, 80, 91, 106, 60, 59]
Girls_Weight = [50, 59, 63, 40, 52, 88, 60, 45, 52, 68] size
= [200, 220, 240, 260, 280, 300, 320, 340, 360, 380]
plt.scatter(Boys_Weight, Girls_Weight, s=size, alpha=0.5)
plt.title("Scatter Plot")
plt.xlabel("Boys_Weight")
plt.ylabel("Girls_Weight")
plt.show()
Boys_Weight = [67, 89, 72, 114, 65, 80, 91, 106, 60, 59] Girls_Weight
= [50, 59, 63, 40, 92, 88, 64, 45, 52, 68]
plt.scatter(Boys_Weight, Girls_Weight, marker ="s", edgecolor ="green", s=50)
Boys_Weight = [54, 67, 92, 56, 83, 65, 89, 78, 50, 49] Girls_Weight
= [41, 79, 56, 74, 76, 73, 74, 87, 82, 63]
plt.scatter(Boys_Weight, Girls_Weight, marker ="^", edgecolor ="red", s=100)
plt.title("Scatter Plot")
plt.xlabel("Boys Weight")
plt.ylabel("Girls Weight")
plt.show()
# Importing library
from matplotlib import pyplot as plt
# Data values
Year = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020] Student
= [88,76,61,68,92,85,62,58,75,83]
# Show plot
plt.show()
Year = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020] Student
= [88,76,61,68,92,85,62,58,75,83]
plt.title("Line Chart")
plt.xlabel("Year")
plt.ylabel("No. of Students")
plt.plot(Year, Student)
plt.show()
We can set the line color also using color as an argument. Here we set red color to the
line.
Year = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020] Student
= [88,76,61,68,92,85,62,58,75,83]
plt.title("Line Chart")
plt.xlabel("Year")
plt.ylabel("No. of Students")
plt.plot(Year, Student)
plt.show()
We can set the line width also using linewidth or lw as an argument. Here we set 10 to
the linewidth.
Year = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020] Student
= [88,76,61,68,92,85,62,58,75,83]
plt.title("Line Chart")
plt.xlabel("Year")
plt.ylabel("No. of Students")
plt.plot(Year, Student)
Year = [2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020] Student
= [88,76,61,68,92,85,62,58,75,83]
plt.title("Line Chart")
plt.xlabel("Year")
plt.ylabel("No. of Students")
plt.plot(Year, Student,'*')
plt.show()
We can set the multiple lines in a single line chart. Here x-axis and y-axis represent
the different values. We plot the line chart separately for both the axis. Here we set
the dotted as a linestyle in x-axis.
plt.plot(Y)
plt.show()
# Importing library
from matplotlib import pyplot as plt
import numpy as np
We can change the bar color also. We set color as an argument to change the color
of bar. Here we set the red as a color of bar.
plt.title("Bar Chart")
plt.xlabel("Name")
plt.ylabel("Age")
We can set the bar width also. We set width as an argument to set the width of bar.
Here we set 0.2 as a width of bar.
plt.title("Bar Chart")
plt.xlabel("Name")
plt.ylabel("Age")
We can display bar horizontally also instead of vertically. We set the bar horizontally
by using barh() function.
plt.title("Bar Chart")
plt.xlabel("Age")
plt.ylabel("Name")
plt.barh(Name, Age)
df = pd.DataFrame({
"Name":["Rahul","Shreya","Pankaj","Monika","Kalpesh"],
"Maths":[67,83,74,91,56],
"Biology":[90,78,86,75,68],
"English":[60,55,63,71,88]})
df.plot.bar()
7.5 GRAPH
Graph is a pictorial representation of a set of objects. Some pairs of objects are connected
through links. The interaction of link is denoted by points which is known as vertices. The
link which is used to connect the vertices is called edges. We can perform some operation on
graph such as:
Display vertices
Display edges
Add new vertex
Add new edge
Create graph
The dictionary data type is used to present a graph in Python. The vertices of a graph are
representing as the keys of dictionary and the links between the vertices also called edges
which represent as the values of dictionary
Take the following graph as an example.
The above graph consists the following vertices (V) and edges (E).
V = {A, B, C, D, E}
E = {AB, AC, BC, BD, CE, DE}
{'a': ['b', 'c'], 'b': ['a', 'c', 'd'], 'c': ['a', 'b', 'e'], 'd': ['b', 'e'], 'e': ['c', 'd']}
# Importing library
import matplotlib.pyplot as plt
# 3D projection plot
fig = plt.figure()
ax = plt.axes(projection ="3d")
plt.title("3D Projection", color="blue")
This is a most basic three-dimensional plot created using set of (x, y, z) triples. It is also
known as time series plot. This plot is plotted using ax.plot3D function as follow:
# 3D projection fig
= plt.figure()
ax = plt.axes(projection = "3d")
This is a basic three-dimensional plot created using set of (x, y, z) triples. It represents the
data points on three axes to show the relationship between three variables. This plot is
plotted using ax.scatter3D function as follow:
# Importing library
from mpl_toolkits import mplot3d import
numpy as np
import matplotlib.pyplot as plt
# 3D projection fig
= plt.figure()
ax = plt.axes(projection = "3d")
# 3D scatter plotting
ax.scatter3D(x, y, z, color="red")
ax.set_title("3D Scatter Plot", color="blue") plt.show()
# Importing library
from mpl_toolkits.mplot3d import axes3d
import matplotlib.pyplot as plt import
numpy as np
from matplotlib import style
# 3D projection fig
= plt.figure()
ax = fig.add_subplot(111, projection="3d")
dx = np.ones(10) dy
= np.ones(10)
dz = [1,3,5,7,9,11,7,5,3,7]
# 3D projection fig
= plt.figure()
ax = plt.axes(projection="3d")
# Function def
func(x, y):
return np.sin(np.sqrt(x * x + y * y))
# 3D wireframe plotting
ax.plot_wireframe(X, Y, Z, color="Green")
ax.set_title("3D Wire Plot", color="blue")
plt.show()
# 3D projection fig
= plt.figure()
ax = plt.axes(projection="3d")
# Function def
func(x, y):
return np.sin(np.sqrt(x * x + y * y))
# 3D surface plotting
ax.plot_surface(X, Y, Z, rstride=1, cstride=1,cmap='viridis', edgecolor='none')
ax.set_title("3D Surface Plot", color="blue") plt.show()
REFERENCES
Books
11. Davy Cielen, Arno D. B. Meysman, Mohamed Ali: Introducing Data Science, Manning
Publications Co.
12. Stephen Klosterman (2019) : Data Science Projects with Python, Packt Publishing
13. Jake VanderPlas (2017) : Python Data Science Handbook: Essential Tools for
Working with Data, O‟Reilly
14. Wes McKinnery and Pandas Development Team (2021) : pandas : powerful Python
data analysis toolkit, Release 1.2.3
Web References
1. https://www.oreilly.com
2. https://www.geeksforgeeks.org
3. https://www.tutorialspoint.com
4. https://www.w3schools.com
5. https://pandas.pydata.org
6. https://pbpython.com
7. https://jakevdp.github.io
8. https://matplotlib.org
9. https://www.logianalytics.com
10. https://seleritysas.com
QUESTIONS
Short Answer:
Long Answer:
PRACTICALS