0% found this document useful (0 votes)

28 views13 pages

PDS Chapter 5

This will be helpful for 5 sem students of gtu this pdf is for python for data science

Uploaded by

Fake Id

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views13 pages

PDS Chapter 5

This will be helpful for 5 sem students of gtu this pdf is for python for data science

Uploaded by

Fake Id

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Python for Data Science(3150713)

Chepter 5: - Data Wrangling

Q-1. Write a short note on Data Wrangling.

Data Wrangling is the process of cleaning, transforming, and
organizing raw data into a structured and usable format for analysis.
This process is crucial because real-world data often comes in
inconsistent, incomplete, or unstructured forms, such as missing
values, duplicate entries, or incorrect formats.
Key steps in data wrangling include:
1. Data Cleaning: Identifying and fixing or removing inaccurate,
missing, or inconsistent data.
2. Data Transformation: Converting data into the desired format,
including normalization, aggregation, or restructuring.
3. Data Integration: Combining data from different sources into a
single, cohesive dataset.
4. Data Enrichment: Adding relevant external data to enhance the
original dataset.

Q-2. Write a short note on Exploring Data Analysis/

Exploratory Data Analysis (EDA).
Exploratory Data Analysis (EDA) is a crucial initial step in data science
projects. It involves analyzing and visualizing data to understand its
key characteristics, uncover patterns, and identify relationships
between variables refers to the method of studying and exploring record
sets to apprehend their predominant traits, discover patterns, locate
outliers, and identify relationships between variables. EDA is normally
carried out as a preliminary step before undertaking extra formal
statistical analyses or modeling.
Key aspects of EDA include:
Python for Data Science(3150713)

• Distribution of Data: Examining the distribution of data points

to understand their range, central tendencies (mean, median), and
dispersion (variance, standard deviation).
• Graphical Representations: Utilizing charts such as histograms,
box plots, scatter plots, and bar charts to visualize relationships
within the data and distributions of variables.
• Outlier Detection: Identifying unusual values that deviate from
other data points. Outliers can influence statistical analyses and
might indicate data entry errors or unique cases.
• Correlation Analysis: Checking the relationships between
variables to understand how they might affect each other. This
includes computing correlation coefficients and creating
correlation matrices.
• Handling Missing Values: Detecting and deciding how to
address missing data points, whether by imputation or removal,
depending on their impact and the amount of missing data.
• Summary Statistics: Calculating key statistics that provide
insight into data trends and nuances.
• Testing Assumptions: Many statistical tests and models assume
the data meet certain conditions (like normality or
homoscedasticity). EDA helps verify these assumptions.
Python for Data Science(3150713)

Q-3. Differentiate Numerical Data and Categorical Data

with suitable example. Also explain how to handle such data
types.

Numerical Data and Categorical Data are two fundamental types of

data, each requiring different approaches for analysis and handling.
1. Numerical Data:
• Definition: Data that consists of numbers and can be quantified.
• Types:
o Discrete: Data with countable values (e.g., integers).
Number of cars in a parking lot (e.g., 10, 15, 20).
o Continuous: Data that can take any value within a range
(e.g., floating-point numbers).
Height of students (e.g., 5.6 feet, 6.1 feet).

Handling:
• Encoding:
o For nominal data, use One-Hot Encoding to convert
categories into binary columns.
o For ordinal data, use Label Encoding or manual mapping
to assign meaningful numerical values to ordered
categories.
• Missing Values: Use modes (most frequent values) for
imputation or treat them as a separate category.
Python for Data Science(3150713)

2.Categorical Data:
• Definition: Data that represents categories or groups and is non-
numeric.
• Types:
o Nominal: Categories with no intrinsic order (e.g., colors,
gender).
Car brands (e.g., Toyota, Honda, Ford).
o Ordinal: Categories with a meaningful order but no fixed
interval between them (e.g., rankings like "High",
"Medium", "Low").
Satisfaction levels (e.g., Poor, Fair, Good,
Excellent).

• Handling:
o Encoding:
▪ For nominal data, use One-Hot Encoding to convert
categories into binary columns.
▪ For ordinal data, use Label Encoding or manual
mapping to assign meaningful numerical values to
ordered categories.
o Missing Values: Use modes (most frequent values) for
imputation or treat them as a separate category.
Python for Data Science(3150713)

Q-4. Write a short note on Classes in Scikit-learn library.’

Classes in Scikit-learn
• Understanding how classes work is an important prerequisite for
being able to use the Scikit-learn package appropriately.
• Scikit-learn is the package for machine learning and data science
experimentation favored by most data scientists.
• It contains a wide range of well-established learning algorithms,
error functions, and testing procedures.
• Install: conda install scikit-learn
• There are four class types covering all the basic machine-learning
functionalities:
o Classifying
o Regressing
o Grouping by clusters
o Transforming data
• Even though each base class has specific methods and attributes,
the core functionalities for data processing and machine learning
are guaranteed by one or more series of methods and attributes
called interfaces.
• The interfaces provide a uniform Application Programming
Interface (API) to enforce similarity of methods and attributes
between all the different algorithms present in the package. There
are four Scikit-learn object-based interfaces:
o estimator: For fitting parameters, learning them from data
according to the algorithm.
o predictor: For generating predictions from the fitted
parameters.
o transformer: For transforming data, implementing the
fitted parameters.
Python for Data Science(3150713)

o model: For reporting goodness of fit or other score

measures.
• The package groups the algorithms built on base classes and one
or more object interfaces into modules, each module displaying a
specialization in a particular type of machine-learning solution.
For example, the "linear_model" module is for linear modeling,
and metrics is for score and loss measures.

Q-5. Diffrentiate Supervised and Unsupervised learning.

Feature Supervised Learning Unsupervised Learning

Input Data Uses Known and Uses Unknown Data as
Labeled Data as input input
Computational Less Computational More Computational
Complexity Complexity Complexity
Real-Time Uses off-line analysis Uses Real-Time Analysis
of Data
Number of The number of Classes The number of Classes is
Classes is known not known
Accuracy of Accurate and Reliable Moderate Accurate and
Results Results Reliable Results
Output data The desired output is The desired, output is not
given. given.
Model In supervised learning it In unsupervised learning
is not possible to learn it is possible to learn
larger and more larger and more complex
complex models than in models than in supervised
unsupervised learning learning
Training data In supervised learning In unsupervised learning
training data is used to training data is not used.
infer model
Python for Data Science(3150713)

Another name Supervised learning is Unsupervised learning is

also called also called clustering.
classification.
Categorized Supervised learning can Unsupervised Learning
be categorized in can be classified in
Classification and Clustering and
Regression problems. Associations problems.
Test of model We can test our model. We can not test our model.
Example Optical Character Find a face in an image.
Recognition

Q-6. Explain Hasing Trick in python with example.

Most machine learning algorithms require numerical inputs. If your
data consists of text, it needs to be converted into numeric values. This
can be done using the hashing trick.
Example
• Data Sample:
Id Salary Gender Gender_Num
1 10000 Male 0
2 15000 Female 1
3 12000 Female 1
4 13000 Male 0
5 14000 Male 0

When dealing with text, one of the most useful solutions provided by
the Scikit-learn package is the hashing trick.
Python for Data Science(3150713)

Example Text
• "Prime is the best engineering college in Navsari."
• "Navsari is famous for engineering."
• "College is located in Mangrol."
Numerical Representation
Index 1 2 3 4 5 6 7 8 9 10 11 12 13
1 1 0 1 0 1 1 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 1 0 0 0 1

3 0 1 0 0 0 1 1 1 0 0 1 0 0

Code Example:-
from sklearn.feature_extraction.text import CountVectorizer

# Sample text
text = [
"Prime is the best engineering college in Navsari",
"Navsari is famous for engineering",
"College is located in Mangrol"
]

# Create a CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)

# Output the matrix

print(X.toarray())
OutPut:-

[[1 1 1 0 0 1 1 0 0 1 1 1]
[0 0 1 1 1 0 1 0 0 1 0 0]
[0 1 0 0 0 1 1 1 1 0 0 0]]
Python for Data Science(3150713)

Q-7. Explain timeit magic command in Jupyter Notebook

with example.
We can find the time taken to execute a statement or a cell in a Jupyter
Notebook with the help of the timeit magic command.
This command can be used both as a line and cell magic:
• In line mode you can time a single-line statement (though
multiple ones can be chained with using semicolons).
Syntax: %timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o]
• In cell mode, the statement in the first line is used as setup code
(executed but not timed) and the body of the cell is timed. The
cell body has access to any variables created in the setup code.
• Syntax: %%timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o] Here, -n flag
represents the number of loops and -r flag represents the number
of repeats.
• Example:
%%timeit -n 1000 -r 7

for i in range(2,1000):
listPrime = []
flag = 0
for j in range(2, i):
if i % j == 0:
flag = 1
break
if flag == 0:
listPrime.append(i)
OutPut:-

14.7 ms ± 953 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops
each)
Python for Data Science(3150713)

Q-8. Explain Memory Profiler in Python.

This is a python module for monitoring memory consumption of a
process as well as line-by-line analysis of memory consumption for
Python programs.
Installation: pip install memory_profiler
Usage:
• First, load the profiler by writing %load_ext memory_profiler in
the notebook.
• Then write %memit on each statement you want to monitor
memory.

%load_ext memory_profiler
from sklearn.feature_extraction.text import CountVectorizer
%memit countVector = CountVectorizer(stop_words='english',
analyzer='word')
OutPut:-
peak memory: 85.61 MiB, increment: 0.02 MiB
Python for Data Science(3150713)

Q-9. Write a program in Python to perform DIFFERENT

STATSTICAL OPERATIONS.

Python has the ability to manipulate some statistical data and calculate
results of various statistical operations using the file “statistics“, useful
in domain of mathematics.

Important Average and measure of central location functions :

1. mean() :- This function returns the mean or average of the data
passed in its arguments. If passed argument is
empty, StatisticsError is raised.
2. mode() :- This function returns the number with maximum
number of occurrences. If passed argument is
empty, StatisticsError is raised.

# importing statistics to handle statistical operations

import statistics

# initializing list
li = [1, 2, 3, 3, 2, 2, 2, 1]

# using mean() to calculate average of list elements

print ("The average of list values is : ",end="")
print (statistics.mean(li))

# using mode() to print maximum occurring of list elements

print ("The maximum occurring element is : ",end="")
print (statistics.mode(li))
OutPut:-
The average of list values is : 2.0
The maximum occurring element is : 2
Python for Data Science(3150713)

3. median() :- This function is used to calculate the median, i.e middle

element of data. If passed argument is empty, StatisticsError is
raised.
4. median_low() :- This function returns the median of data in case of
odd number of elements, but in case of even number of elements,
returns the lower of two middle elements. If passed argument is
empty, StatisticsError is raised.
5. median_high() :- This function returns the median of data in case of
odd number of elements, but in case of even number of
elements, returns the higher of two middle elements. If passed
argument is empty, StatisticsError is raised.
# importing statistics to handle statistical operations
import statistics

# initializing list
li = [1, 2, 2, 3, 3, 3]

# using median() to print median of list elements

print ("The median of list element is : ",end="")
print (statistics.median(li))

# using median_low() to print low median of list elements

print ("The lower median of list element is : ",end="")
print (statistics.median_low(li))

# using median_high() to print high median of list elements

print ("The higher median of list element is : ",end="")
print (statistics.median_high(li))
OutPut:-
The median of list element is : 2.5
The lower median of list element is : 2
The higher median of list element is : 3
Python for Data Science(3150713)

6. median_grouped() :- This function is used to compute group

median, i.e 50th percentile of the data. If passed argument is
empty, StatisticsError is raised.
# importing statistics to handle statistical operations
import statistics

# initializing list
li = [1, 2, 2, 3, 3, 3]

# using median_grouped() to calculate 50th percentile

print ("The 50th percentile of data is : ",end="")
print (statistics.median_grouped(li))

OutPut:-
The 50th percentile of data is : 2.5

CS3352 FDS QP Solved (Anna University)
100% (1)
CS3352 FDS QP Solved (Anna University)
98 pages
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
No ratings yet
DR Kruti Dangarwala CSE & IT Department Svmit: Python For Data Science Unit 5: Data Wrangling
91 pages
CS3352 Foundations of Data Science Nov Dec 2022
No ratings yet
CS3352 Foundations of Data Science Nov Dec 2022
36 pages
Syllabus AIML
No ratings yet
Syllabus AIML
14 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
Data Science for Engineers Course
No ratings yet
Data Science for Engineers Course
8 pages
Introduction To Python For Data Science - Syllabus
100% (1)
Introduction To Python For Data Science - Syllabus
5 pages
Unit 1-1
No ratings yet
Unit 1-1
10 pages
Data Wrangling with Python Guide
No ratings yet
Data Wrangling with Python Guide
61 pages
Introduction To Python 1
No ratings yet
Introduction To Python 1
13 pages
MTE204 Data Python
No ratings yet
MTE204 Data Python
45 pages
305 Ba Machine Learning & Cognitive Intelligence Using Python
No ratings yet
305 Ba Machine Learning & Cognitive Intelligence Using Python
14 pages
Lab Manual
No ratings yet
Lab Manual
19 pages
Data Wrangling in Python for Data Science
No ratings yet
Data Wrangling in Python for Data Science
39 pages
Python For Data Science
No ratings yet
Python For Data Science
8 pages
305 BA PYTHON - APR 2022 ANSWER Key
No ratings yet
305 BA PYTHON - APR 2022 ANSWER Key
14 pages
Unit 3
No ratings yet
Unit 3
110 pages
Data Wrangling with Scikit-learn
No ratings yet
Data Wrangling with Scikit-learn
24 pages
AML LAB MANUAL Yash
No ratings yet
AML LAB MANUAL Yash
60 pages
Module 1.foundations of Data Science
No ratings yet
Module 1.foundations of Data Science
17 pages
Python Programming Tutorial For Machine Learning Beginners Using
No ratings yet
Python Programming Tutorial For Machine Learning Beginners Using
13 pages
Data Analytics Lab Course Overview
No ratings yet
Data Analytics Lab Course Overview
125 pages
Python GTU Study Material E-Notes Unit-5 16012021061815AM
No ratings yet
Python GTU Study Material E-Notes Unit-5 16012021061815AM
9 pages
Scikit-learn for Data Scientists
No ratings yet
Scikit-learn for Data Scientists
9 pages
Scikit-learn Classes for Machine Learning
No ratings yet
Scikit-learn Classes for Machine Learning
9 pages
Unit III - Functions in Python
No ratings yet
Unit III - Functions in Python
54 pages
ML File Syllabus
No ratings yet
ML File Syllabus
43 pages
A Report Submitted in Partial Fulfillment of The Requirement of The Award of Degree of
No ratings yet
A Report Submitted in Partial Fulfillment of The Requirement of The Award of Degree of
35 pages
Solution
No ratings yet
Solution
18 pages
Numpy Module
No ratings yet
Numpy Module
10 pages
Data Science With Python Previous Questions Answers
No ratings yet
Data Science With Python Previous Questions Answers
27 pages
Python Data Wrangling Guide
No ratings yet
Python Data Wrangling Guide
24 pages
Unit 5 Material
No ratings yet
Unit 5 Material
18 pages
DSBDA
No ratings yet
DSBDA
145 pages
Introduction To Data Science - 1650687630477
No ratings yet
Introduction To Data Science - 1650687630477
34 pages
Python and Machine Learning Seminar
No ratings yet
Python and Machine Learning Seminar
17 pages
Data Science Python
No ratings yet
Data Science Python
42 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
Data Science Training Report
100% (1)
Data Science Training Report
26 pages
Python For Data Science - ANR PL - Final
No ratings yet
Python For Data Science - ANR PL - Final
194 pages
Q-Step WS 06112019 Data Analysis and Visualisation With Python
No ratings yet
Q-Step WS 06112019 Data Analysis and Visualisation With Python
76 pages
Ds Viva
No ratings yet
Ds Viva
9 pages
Industrialreport
No ratings yet
Industrialreport
26 pages
Understanding Three Dataset Types
No ratings yet
Understanding Three Dataset Types
5 pages
SENG419-python 98745
No ratings yet
SENG419-python 98745
103 pages
Unit 1
No ratings yet
Unit 1
84 pages
Data Science With Python Updated Brochure
No ratings yet
Data Science With Python Updated Brochure
13 pages
Viva
No ratings yet
Viva
7 pages
NPTEL
No ratings yet
NPTEL
13 pages
PDSC Few Questions Answers 2020
No ratings yet
PDSC Few Questions Answers 2020
36 pages
Ass1 DSBDA Writeup
No ratings yet
Ass1 DSBDA Writeup
8 pages
Scientific Computing with Python
No ratings yet
Scientific Computing with Python
4 pages
Python Data Analysis Guide
No ratings yet
Python Data Analysis Guide
75 pages
Numpy Data Analysis and Visualisation With Python
No ratings yet
Numpy Data Analysis and Visualisation With Python
75 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
37 pages
Part 1 Lectures
No ratings yet
Part 1 Lectures
100 pages
Digital Principal and System Design
No ratings yet
Digital Principal and System Design
17 pages
PDS Chapter 3
No ratings yet
PDS Chapter 3
37 pages
PDS - Chapter 4
No ratings yet
PDS - Chapter 4
25 pages
PDS Chapter 1
No ratings yet
PDS Chapter 1
15 pages
PDS Chapter 2
No ratings yet
PDS Chapter 2
10 pages
6th Sem MAKAUT Research Methodology PYQ
No ratings yet
6th Sem MAKAUT Research Methodology PYQ
70 pages
AP Stats: Scatterplots & Regression
0% (1)
AP Stats: Scatterplots & Regression
3 pages
Postgraduate PG - Mba - Semester 3 - 2022 - November - Advanced Statistical Method Using R Pattern 2019
No ratings yet
Postgraduate PG - Mba - Semester 3 - 2022 - November - Advanced Statistical Method Using R Pattern 2019
2 pages
Importance of Agricultural Land Drainage
No ratings yet
Importance of Agricultural Land Drainage
34 pages
2023 Oct CSC649 Group Project - Evaluation Form
No ratings yet
2023 Oct CSC649 Group Project - Evaluation Form
1 page
3 6 PDF
No ratings yet
3 6 PDF
2 pages
Kukbit Internship Projects
No ratings yet
Kukbit Internship Projects
14 pages
Lab Session Problem Exercise
No ratings yet
Lab Session Problem Exercise
2 pages
Hierarchical Clustering in Python Guide
No ratings yet
Hierarchical Clustering in Python Guide
30 pages
Correlation Coefficients in Medicine
No ratings yet
Correlation Coefficients in Medicine
6 pages
Chapter 05
No ratings yet
Chapter 05
23 pages
Báo Cáo PTKD
No ratings yet
Báo Cáo PTKD
13 pages
A Survey On Data Anonymization For Big Data Security
No ratings yet
A Survey On Data Anonymization For Big Data Security
4 pages
ML LabManual
No ratings yet
ML LabManual
16 pages
Big Mart Sales Forecasting
No ratings yet
Big Mart Sales Forecasting
6 pages
Vipal P
No ratings yet
Vipal P
4 pages
Eda Model
No ratings yet
Eda Model
3 pages
Seminar: Predictive Analytics
No ratings yet
Seminar: Predictive Analytics
10 pages
Lecture - 5 - Validation
No ratings yet
Lecture - 5 - Validation
30 pages
Two-Way ANOVA Guide in SPSS
No ratings yet
Two-Way ANOVA Guide in SPSS
19 pages
ICA 2 Assignment
No ratings yet
ICA 2 Assignment
5 pages
Assignment Final Report: Qualification Pearson BTEC Level 5 Higher National Diploma in Business
No ratings yet
Assignment Final Report: Qualification Pearson BTEC Level 5 Higher National Diploma in Business
27 pages
Statistics and Probability
50% (2)
Statistics and Probability
4 pages
Introductory Econometrics Test Bank
100% (2)
Introductory Econometrics Test Bank
106 pages
Mathematics in The Modern World - Lecture 4
No ratings yet
Mathematics in The Modern World - Lecture 4
11 pages
BA Practical Index Pages SR
No ratings yet
BA Practical Index Pages SR
4 pages
20 Scenario Q&A For Data Analyst
No ratings yet
20 Scenario Q&A For Data Analyst
4 pages
Module 9 - Simple Linear Regression & Correlation
No ratings yet
Module 9 - Simple Linear Regression & Correlation
29 pages
Google Data Analytics Professional Certificate Part 1
100% (2)
Google Data Analytics Professional Certificate Part 1
25 pages
Research 1##
No ratings yet
Research 1##
27 pages

PDS Chapter 5

Uploaded by

PDS Chapter 5

Uploaded by

Python for Data Science(3150713)

Chepter 5: - Data Wrangling

Q-1. Write a short note on Data Wrangling.

Q-2. Write a short note on Exploring Data Analysis/

• Distribution of Data: Examining the distribution of data points

Q-3. Differentiate Numerical Data and Categorical Data

Numerical Data and Categorical Data are two fundamental types of

Q-4. Write a short note on Classes in Scikit-learn library.’

o model: For reporting goodness of fit or other score

Q-5. Diffrentiate Supervised and Unsupervised learning.

Feature Supervised Learning Unsupervised Learning

Another name Supervised learning is Unsupervised learning is

Q-6. Explain Hasing Trick in python with example.

# Output the matrix

Q-7. Explain timeit magic command in Jupyter Notebook

Q-8. Explain Memory Profiler in Python.

Q-9. Write a program in Python to perform DIFFERENT

Important Average and measure of central location functions :

# importing statistics to handle statistical operations

# using mean() to calculate average of list elements

# using mode() to print maximum occurring of list elements

3. median() :- This function is used to calculate the median, i.e middle

# using median() to print median of list elements

# using median_low() to print low median of list elements

# using median_high() to print high median of list elements

6. median_grouped() :- This function is used to compute group

# using median_grouped() to calculate 50th percentile

You might also like