Python for Data Science(3150713)
Chepter 5: - Data Wrangling
Q-1. Write a short note on Data Wrangling.
Data Wrangling is the process of cleaning, transforming, and
organizing raw data into a structured and usable format for analysis.
This process is crucial because real-world data often comes in
inconsistent, incomplete, or unstructured forms, such as missing
values, duplicate entries, or incorrect formats.
Key steps in data wrangling include:
1. Data Cleaning: Identifying and fixing or removing inaccurate,
missing, or inconsistent data.
2. Data Transformation: Converting data into the desired format,
including normalization, aggregation, or restructuring.
3. Data Integration: Combining data from different sources into a
single, cohesive dataset.
4. Data Enrichment: Adding relevant external data to enhance the
original dataset.
Q-2. Write a short note on Exploring Data Analysis/
Exploratory Data Analysis (EDA).
Exploratory Data Analysis (EDA) is a crucial initial step in data science
projects. It involves analyzing and visualizing data to understand its
key characteristics, uncover patterns, and identify relationships
between variables refers to the method of studying and exploring record
sets to apprehend their predominant traits, discover patterns, locate
outliers, and identify relationships between variables. EDA is normally
carried out as a preliminary step before undertaking extra formal
statistical analyses or modeling.
Key aspects of EDA include:
Python for Data Science(3150713)
• Distribution of Data: Examining the distribution of data points
to understand their range, central tendencies (mean, median), and
dispersion (variance, standard deviation).
• Graphical Representations: Utilizing charts such as histograms,
box plots, scatter plots, and bar charts to visualize relationships
within the data and distributions of variables.
• Outlier Detection: Identifying unusual values that deviate from
other data points. Outliers can influence statistical analyses and
might indicate data entry errors or unique cases.
• Correlation Analysis: Checking the relationships between
variables to understand how they might affect each other. This
includes computing correlation coefficients and creating
correlation matrices.
• Handling Missing Values: Detecting and deciding how to
address missing data points, whether by imputation or removal,
depending on their impact and the amount of missing data.
• Summary Statistics: Calculating key statistics that provide
insight into data trends and nuances.
• Testing Assumptions: Many statistical tests and models assume
the data meet certain conditions (like normality or
homoscedasticity). EDA helps verify these assumptions.
Python for Data Science(3150713)
Q-3. Differentiate Numerical Data and Categorical Data
with suitable example. Also explain how to handle such data
types.
Numerical Data and Categorical Data are two fundamental types of
data, each requiring different approaches for analysis and handling.
1. Numerical Data:
• Definition: Data that consists of numbers and can be quantified.
• Types:
o Discrete: Data with countable values (e.g., integers).
Number of cars in a parking lot (e.g., 10, 15, 20).
o Continuous: Data that can take any value within a range
(e.g., floating-point numbers).
Height of students (e.g., 5.6 feet, 6.1 feet).
Handling:
• Encoding:
o For nominal data, use One-Hot Encoding to convert
categories into binary columns.
o For ordinal data, use Label Encoding or manual mapping
to assign meaningful numerical values to ordered
categories.
• Missing Values: Use modes (most frequent values) for
imputation or treat them as a separate category.
Python for Data Science(3150713)
2.Categorical Data:
• Definition: Data that represents categories or groups and is non-
numeric.
• Types:
o Nominal: Categories with no intrinsic order (e.g., colors,
gender).
Car brands (e.g., Toyota, Honda, Ford).
o Ordinal: Categories with a meaningful order but no fixed
interval between them (e.g., rankings like "High",
"Medium", "Low").
Satisfaction levels (e.g., Poor, Fair, Good,
Excellent).
• Handling:
o Encoding:
▪ For nominal data, use One-Hot Encoding to convert
categories into binary columns.
▪ For ordinal data, use Label Encoding or manual
mapping to assign meaningful numerical values to
ordered categories.
o Missing Values: Use modes (most frequent values) for
imputation or treat them as a separate category.
Python for Data Science(3150713)
Q-4. Write a short note on Classes in Scikit-learn library.’
Classes in Scikit-learn
• Understanding how classes work is an important prerequisite for
being able to use the Scikit-learn package appropriately.
• Scikit-learn is the package for machine learning and data science
experimentation favored by most data scientists.
• It contains a wide range of well-established learning algorithms,
error functions, and testing procedures.
• Install: conda install scikit-learn
• There are four class types covering all the basic machine-learning
functionalities:
o Classifying
o Regressing
o Grouping by clusters
o Transforming data
• Even though each base class has specific methods and attributes,
the core functionalities for data processing and machine learning
are guaranteed by one or more series of methods and attributes
called interfaces.
• The interfaces provide a uniform Application Programming
Interface (API) to enforce similarity of methods and attributes
between all the different algorithms present in the package. There
are four Scikit-learn object-based interfaces:
o estimator: For fitting parameters, learning them from data
according to the algorithm.
o predictor: For generating predictions from the fitted
parameters.
o transformer: For transforming data, implementing the
fitted parameters.
Python for Data Science(3150713)
o model: For reporting goodness of fit or other score
measures.
• The package groups the algorithms built on base classes and one
or more object interfaces into modules, each module displaying a
specialization in a particular type of machine-learning solution.
For example, the "linear_model" module is for linear modeling,
and metrics is for score and loss measures.
Q-5. Diffrentiate Supervised and Unsupervised learning.
Feature Supervised Learning Unsupervised Learning
Input Data Uses Known and Uses Unknown Data as
Labeled Data as input input
Computational Less Computational More Computational
Complexity Complexity Complexity
Real-Time Uses off-line analysis Uses Real-Time Analysis
of Data
Number of The number of Classes The number of Classes is
Classes is known not known
Accuracy of Accurate and Reliable Moderate Accurate and
Results Results Reliable Results
Output data The desired output is The desired, output is not
given. given.
Model In supervised learning it In unsupervised learning
is not possible to learn it is possible to learn
larger and more larger and more complex
complex models than in models than in supervised
unsupervised learning learning
Training data In supervised learning In unsupervised learning
training data is used to training data is not used.
infer model
Python for Data Science(3150713)
Another name Supervised learning is Unsupervised learning is
also called also called clustering.
classification.
Categorized Supervised learning can Unsupervised Learning
be categorized in can be classified in
Classification and Clustering and
Regression problems. Associations problems.
Test of model We can test our model. We can not test our model.
Example Optical Character Find a face in an image.
Recognition
Q-6. Explain Hasing Trick in python with example.
Most machine learning algorithms require numerical inputs. If your
data consists of text, it needs to be converted into numeric values. This
can be done using the hashing trick.
Example
• Data Sample:
Id Salary Gender Gender_Num
1 10000 Male 0
2 15000 Female 1
3 12000 Female 1
4 13000 Male 0
5 14000 Male 0
When dealing with text, one of the most useful solutions provided by
the Scikit-learn package is the hashing trick.
Python for Data Science(3150713)
Example Text
• "Prime is the best engineering college in Navsari."
• "Navsari is famous for engineering."
• "College is located in Mangrol."
Numerical Representation
Index 1 2 3 4 5 6 7 8 9 10 11 12 13
1 1 0 1 0 1 1 0 0 0 0 0 0 0
2 0 0 0 1 0 0 0 0 1 0 0 0 1
3 0 1 0 0 0 1 1 1 0 0 1 0 0
Code Example:-
from sklearn.feature_extraction.text import CountVectorizer
# Sample text
text = [
"Prime is the best engineering college in Navsari",
"Navsari is famous for engineering",
"College is located in Mangrol"
]
# Create a CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text)
# Output the matrix
print(X.toarray())
OutPut:-
[[1 1 1 0 0 1 1 0 0 1 1 1]
[0 0 1 1 1 0 1 0 0 1 0 0]
[0 1 0 0 0 1 1 1 1 0 0 0]]
Python for Data Science(3150713)
Q-7. Explain timeit magic command in Jupyter Notebook
with example.
We can find the time taken to execute a statement or a cell in a Jupyter
Notebook with the help of the timeit magic command.
This command can be used both as a line and cell magic:
• In line mode you can time a single-line statement (though
multiple ones can be chained with using semicolons).
Syntax: %timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o]
• In cell mode, the statement in the first line is used as setup code
(executed but not timed) and the body of the cell is timed. The
cell body has access to any variables created in the setup code.
• Syntax: %%timeit [-n<N> -r<R> [-t|-c] -q -p<P> -o] Here, -n flag
represents the number of loops and -r flag represents the number
of repeats.
• Example:
%%timeit -n 1000 -r 7
for i in range(2,1000):
listPrime = []
flag = 0
for j in range(2, i):
if i % j == 0:
flag = 1
break
if flag == 0:
listPrime.append(i)
OutPut:-
14.7 ms ± 953 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops
each)
Python for Data Science(3150713)
Q-8. Explain Memory Profiler in Python.
This is a python module for monitoring memory consumption of a
process as well as line-by-line analysis of memory consumption for
Python programs.
Installation: pip install memory_profiler
Usage:
• First, load the profiler by writing %load_ext memory_profiler in
the notebook.
• Then write %memit on each statement you want to monitor
memory.
%load_ext memory_profiler
from sklearn.feature_extraction.text import CountVectorizer
%memit countVector = CountVectorizer(stop_words='english',
analyzer='word')
OutPut:-
peak memory: 85.61 MiB, increment: 0.02 MiB
Python for Data Science(3150713)
Q-9. Write a program in Python to perform DIFFERENT
STATSTICAL OPERATIONS.
Python has the ability to manipulate some statistical data and calculate
results of various statistical operations using the file “statistics“, useful
in domain of mathematics.
Important Average and measure of central location functions :
1. mean() :- This function returns the mean or average of the data
passed in its arguments. If passed argument is
empty, StatisticsError is raised.
2. mode() :- This function returns the number with maximum
number of occurrences. If passed argument is
empty, StatisticsError is raised.
# importing statistics to handle statistical operations
import statistics
# initializing list
li = [1, 2, 3, 3, 2, 2, 2, 1]
# using mean() to calculate average of list elements
print ("The average of list values is : ",end="")
print (statistics.mean(li))
# using mode() to print maximum occurring of list elements
print ("The maximum occurring element is : ",end="")
print (statistics.mode(li))
OutPut:-
The average of list values is : 2.0
The maximum occurring element is : 2
Python for Data Science(3150713)
3. median() :- This function is used to calculate the median, i.e middle
element of data. If passed argument is empty, StatisticsError is
raised.
4. median_low() :- This function returns the median of data in case of
odd number of elements, but in case of even number of elements,
returns the lower of two middle elements. If passed argument is
empty, StatisticsError is raised.
5. median_high() :- This function returns the median of data in case of
odd number of elements, but in case of even number of
elements, returns the higher of two middle elements. If passed
argument is empty, StatisticsError is raised.
# importing statistics to handle statistical operations
import statistics
# initializing list
li = [1, 2, 2, 3, 3, 3]
# using median() to print median of list elements
print ("The median of list element is : ",end="")
print (statistics.median(li))
# using median_low() to print low median of list elements
print ("The lower median of list element is : ",end="")
print (statistics.median_low(li))
# using median_high() to print high median of list elements
print ("The higher median of list element is : ",end="")
print (statistics.median_high(li))
OutPut:-
The median of list element is : 2.5
The lower median of list element is : 2
The higher median of list element is : 3
Python for Data Science(3150713)
6. median_grouped() :- This function is used to compute group
median, i.e 50th percentile of the data. If passed argument is
empty, StatisticsError is raised.
# importing statistics to handle statistical operations
import statistics
# initializing list
li = [1, 2, 2, 3, 3, 3]
# using median_grouped() to calculate 50th percentile
print ("The 50th percentile of data is : ",end="")
print (statistics.median_grouped(li))
OutPut:-
The 50th percentile of data is : 2.5