0% found this document useful (0 votes)

14 views37 pages

PDS Chapter 3

This will be helpful for 5 sem students of gtu this pdf is for python for data science

Uploaded by

Fake Id

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views37 pages

PDS Chapter 3

This will be helpful for 5 sem students of gtu this pdf is for python for data science

Uploaded by

Fake Id

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Python for data science (3150713)

Chepter 3:- Getting Your Hands Dirty With Data

Q-1. List different IDE of Pythons. Explain advantages and

disadvantages of each.
1. PyCharm
• Description: A dedicated Python IDE from JetBrains, available in
Community (free) and Professional (paid) editions.
Advantages:
• Smart Code Assistance: Advanced code completion, error detection, and
code refactoring.
• Debugger: Robust debugging capabilities with breakpoints, variable
watching, and step-through execution.
• Integrated Tools: Support for version control systems (e.g., Git), Docker,
and databases.
• Professional Edition: Supports web development frameworks like Django
and Flask.
Disadvantages:
• Heavy Resource Usage: High memory and CPU usage can slow down
performance, especially for large projects.
• Steep Learning Curve: Can be overwhelming for beginners.
• Professional Edition is Paid: The free Community edition lacks some
advanced features.

2. Visual Studio Code (VS Code)

• Description: A lightweight, open-source code editor from Microsoft, with
Python support via extensions.
Advantages:
• Extensibility: Highly customizable through extensions, including Python,
Git, Docker, and more.
• Integrated Debugging: Offers built-in debugging tools for Python.
Python for data science (3150713)

• Performance: Lightweight and fast compared to full-fledged IDEs.

• Cross-platform: Runs on Windows, macOS, and Linux.
Disadvantages:
• Requires Configuration: Needs extensions to achieve full Python
functionality, which may take some time to set up.
• Can Become Slow: Installing too many extensions can affect performance.
• Not as Feature-Rich: Lacks some of the advanced features found in
dedicated IDEs like PyCharm out of the box.

3. Spyder
• Description: An open-source IDE tailored for data science and scientific
computing, often used in conjunction with packages like NumPy and
pandas.
Advantages:
• Data Science Oriented: Built-in integration with data analysis and
visualization libraries like Matplotlib and pandas.
• Variable Explorer: Allows inspection and manipulation of data variables in
memory, making it great for data analysis.
• Lightweight: Easy to set up and use for quick scripts or data analysis tasks.
Disadvantages:
• Limited Features: Lacks some advanced features found in full-fledged
IDEs (e.g., web development tools).
• Not Suited for Larger Applications: Geared towards scientific computing,
not complex software development.
• Basic UI: The interface feels outdated compared to modern IDEs.

4. Jupyter Notebook / JupyterLab

• Description: A web-based interactive development environment widely
used in data science and machine learning.
Advantages:
Python for data science (3150713)

• Interactive: Supports live code execution, visualization, and markdown,

making it ideal for iterative development.
• Visualization: Seamless integration with libraries like Matplotlib,
TensorFlow, and pandas for plotting and data analysis.
• Notebooks: Shareable notebooks are great for collaboration and
presentations.
Disadvantages:
• Not for Large Projects: Managing complex projects can be challenging.
• Basic Debugging: Lacks robust debugging and error-checking features.
• Can be Slow: Performance issues can arise with large datasets or long
notebooks.

5. Thonny
• Description: A simple Python IDE designed for beginners, offering an
easy-to-use interface with debugging tools.
Advantages:
• Beginner-Friendly: Designed for learners with a simple interface and a
focus on ease of use.
• Integrated Debugger: Step-through debugging with visual representations
of variables and expressions.
• Lightweight: Small and fast, ideal for learning Python without complex
configurations.
Disadvantages:
• Limited Features: Lacks advanced tools and features needed for large or
complex projects.
• Not Customizable: Fewer customization and extension options compared
to more advanced IDEs.
• Not Suitable for Advanced Users: Lacks support for advanced libraries and
web frameworks.
Python for data science (3150713)

6. IDLE
• Description: The default IDE that comes bundled with Python, offering a
basic environment for simple scripts.
Advantages:
• Pre-installed: Comes with Python, so no need to install separately.
• Simple Interface: Very easy to use, especially for beginners.
• Lightweight: Minimalist design with fast startup and execution.
Disadvantages:
• Lack of Features: Very basic in terms of functionality, lacks features like
code analysis, project management, and integration with version control.
• Not for Large Projects: Unsuitable for complex or large-scale applications.
• Outdated UI: The user interface is quite dated compared to modern IDEs.

7. Wing IDE
• Description: A professional-grade Python IDE focused on providing
advanced debugging and development tools.
Advantages:
• Powerful Debugger: Provides in-depth debugging tools with breakpoints,
call stacks, and variable inspection.
• Professional Features: Designed for professional development, with
support for unit testing, code refactoring, and Django integration.
• Customizable: Extensive options to customize the development
environment.
Disadvantages:
• Paid Software: Free version is limited in features, and the full version is
paid.
• Resource-Intensive: Can be slow, especially when working with large
projects.
• Steeper Learning Curve: More complex to use for beginners.
Python for data science (3150713)

Q-2. Write a short note on Jupyter notebooks.

Jupyter Notebooks are an open-source, web-based platform that allows users to
create and share documents that combine live code, visualizations, and narrative
text.
Primarily used in data science, machine learning, and scientific research, they
support multiple programming languages like Python, R, and Julia.
Jupyter Notebooks offer an interactive environment where users can write and
execute code in cells, display results and document their workflows using
markdown text.
This makes them ideal for data exploration, analysis, visualization, and sharing
reproducible research. Notebooks can be easily shared or exported in various
formats, facilitating collaboration and communication.

Key Features:

1. Interactive coding: Write and execute code in cells.

2. Multi-language support: Supports over 40 programming languages, including
Python, R, Julia, and MATLAB.
3. Rich text editing: Include LaTeX equations, images, and HTML content.
4. Visualization: Integrate plots, charts, and graphs from popular libraries like
Matplotlib and Seaborn.
5. Collaboration: Share notebooks via URL or export to various formats (e.g.,
PDF, HTML, Markdown).
6. Extensive libraries: Access thousands of libraries and tools, including NumPy,
Pandas, and Scikit-learn.

Benefits:
1. Data exploration and visualization
2. Prototyping and testing
3. Education and research
4. Collaboration and communication
Python for data science (3150713)

5. Reproducibility and transparency

Applications:
1. Data science and analytics
2. Machine learning and AI
3. Scientific computing and research
4. Education and teaching
5. Business intelligence and reporting

Q-3. Explain Basic IO operations in Python.

Basic I/O (Input/Output) operations in Python involve reading from input sources
like the keyboard and writing output to destinations like the console or a file.
Here’s a breakdown of these operations:

1. Input Operations:
Python provides the input() function to get input from users.

Syntax:
user_input = input("Prompt message: ")

• Explanation: The input() function displays the given prompt and waits for
the user to type something. It reads the user input as a string.

Example:
name = input("Enter your name: ")
print("Hello, " + name)

If the user types "Alice," the output will be:

Hello, Alice
Python for data science (3150713)

For numeric input, you can convert the string to an integer or float:
age = int(input("Enter your age: "))
height = float(input("Enter your height: "))

2. Output Operations:
The print() function is used to output data to the console.
Syntax:
print(*objects, sep=' ', end='\n')

• objects: One or more expressions you want to print.

• sep: Optional. Specifies a separator between the objects (default is a space).
• end: Optional. Specifies what to print at the end (default is a newline \n).

Example:
print("Hello, world!")
Output:
Hello, world!

You can print multiple objects, customize separators, and the end character:

print("Hello", "World", sep='-', end='!')

Output:
Hello-World!

3. File I/O Operations:

Python also allows reading from and writing to files using built-in functions.

Opening a File:
You can open a file using the open() function.
file = open('filename', mode)
Python for data science (3150713)

• filename: Name of the file to be opened.

• mode: Specifies the mode for opening the file (read, write, append, etc.).

Common modes:
• 'r': Read (default mode, file must exist).
• 'w': Write (creates a new file or overwrites if it exists).
• 'a': Append (adds content to the end of the file).
• 'r+': Read and write.

Writing to a File:
To write data to a file, use the write() or writelines() function.
file = open('example.txt', 'w')
file.write("Hello, world!")
file.close()

Reading from a File:

You can read the contents of a file using the read(), readline(), or readlines()
functions.
file = open('example.txt', 'r')
content = file.read()
print(content)
file.close()

Best Practice: Use with Statement:

The with statement ensures that the file is properly closed after its operations are
done.
with open('example.txt', 'r') as file:
content = file.read()
print(content)
Python for data science (3150713)

Q-4. Write a short note on Data Conditioning.

Data Conditioning: Preparing Data for Analysis

Data conditioning is the process of preparing raw data for analysis by cleaning,
transforming, and formatting it to improve its quality, consistency, and reliability.
The goal is to ensure data accuracy, completeness, and relevance for statistical
modeling, machine learning, or data visualization.

Key Steps in Data Conditioning:

1. Data Cleaning: Identify and correct errors, handle missing values, and remove
duplicates or irrelevant data.
2. Data Transformation: Convert data types (e.g., text to numerical),
scale/normalize values, and perform feature engineering.
3. Data Integration: Combine data from multiple sources, handle inconsistencies,
and perform data merging.
4. Data Reduction: Select relevant features, reduce dimensionality, and remove
noisy or redundant data.
5. Data Formatting: Reorganize data structures, handle date/time formats, and
ensure data consistency.

Techniques Used:

1. Handling missing values (imputation, interpolation)

2. Data normalization (min-max scaling, standardization)
3. Feature scaling (log transformation, standardization)
4. Encoding categorical variables (one-hot encoding, label encoding)
5. Data aggregation (grouping, summarization)
Python for data science (3150713)

Benefits:

1. Improved data quality and accuracy

2. Enhanced model performance and reliability
3. Better data visualization and insights
4. Reduced errors and biases
5. Increased efficiency in data analysis

Tools and Technologies:

1. Pandas and NumPy (Python libraries)

2. R programming language
3. Data preprocessing tools (e.g., OpenRefine, Trifacta)
4. Machine learning frameworks (e.g., scikit-learn, TensorFlow)

Q-5. Write a short note on Data Shaping.

Data Shaping: Reshaping Data for Analysis
Data shaping is the process of transforming and restructuring data from its raw
format into a suitable format for analysis, visualization, or modeling. It involves
rearranging data to facilitate insights, improve data quality, and enhance
compatibility with various tools and techniques.

Types of Data Shaping:

1. Pivoting: Rotating data from long to wide format or vice versa.
2. Merging: Combining data from multiple sources.
3. Aggregating: Grouping data and calculating summaries.
4. Reshaping: Changing data structure (e.g., from flat to hierarchical).
5. Transposing: Swapping rows and columns.
Python for data science (3150713)

Data Shaping Techniques:

1. Data stacking: Combining multiple datasets.
2. Data unstacking: Separating stacked data.
3. Data melting: Converting wide data to long format.
4. Data casting: Converting data types.

Benefits:
1. Improved data readability
2. Enhanced data analysis capabilities
3. Better data visualization
4. Increased compatibility with tools and models
5. Simplified data management

By applying data shaping techniques, data analysts and scientists can transform
raw data into a meaningful, analysis-ready format, unlocking insights and
informing business decisions.

Some examples of data shaping functions in Python using Pandas library:

import pandas as pd

# Create a sample dataframe

data = {'Name': ['John', 'Mary', 'David'],
'Age': [25, 31, 42],
'Country': ['USA', 'Canada', 'UK']}
df = pd.DataFrame(data)

# Pivot data
pivoted_df = pd.pivot_table(df, values='Age', index='Name',
columns='Country')
Python for data science (3150713)

# Merge data
df1 = pd.DataFrame({'Name': ['John', 'Mary'], 'Gender': ['Male', 'Female']})
merged_df = pd.merge(df, df1, on='Name')

# Reshape data
reshaped_df = pd.melt(df, id_vars='Name', value_vars='Age')

print(pivoted_df)
print(merged_df)
print(reshaped_df)

Output:-

Country Canada UK USA

Name
David NaN 42.0 NaN
John NaN NaN 25.0
Mary 31.0 NaN NaN
Name Age Country Gender
0 John 25 USA Male
1 Mary 31 Canada Female
Name variable value
0 John Age 25
1 Mary Age 31
2 David Age 42
Python for data science (3150713)

Q-6. Differentiate Numpy and Pandas

Pandas NumPy

When we have to work on Tabular data, we When we have to work on Numerical

prefer the pandas module. data, we prefer the NumPy module.

The powerful tools of pandas are DataFrame Whereas the powerful tool
and Series. of NumPy is Arrays.

Pandas consume more memory. Numpy is memory efficient.

Pandas have a better performance when the Numpy has a better performance when
number of rows is 500K or more. number of rows is 50K or less.

Indexing of the Pandas series is very slow as Indexing of Numpy arrays is very fast.
compared to Numpy arrays.

Pandas have a 2D table object Numpy is capable of providing multi-

called DataFrame. dimensional arrays.

It was developed by Wes McKinney and was It was developed by Travis Oliphant and
released in 2008. was released in 2005.

It is being used in organizations like

It is used in a lot of organizations like Kaidee,
Walmart Tokopedia, Instacart, and many
Trivago, Abeja Inc., and a lot more.
more.

It has a higher industry application. It has a lower industry application.

Python for data science (3150713)

Q-7. Explain Numpy Array with example.

NumPy stands for Numerical Python. It is a Python library used for working with
an array. In Python, we use the list for the array but it’s slow to process. NumPy
array is a powerful N-dimensional array object and is used in linear algebra,
Fourier transform, and random number capabilities. It provides an array object
much faster than traditional Python lists.
Types of Array:
1. One Dimensional Array
2. Multi-Dimensional Array
One Dimensional Array:
A one-dimensional array is a type of linear array.

# importing numpy module

import numpy as np

# creating list
list = [1, 2, 3, 4]

# creating numpy array

sample_array = np.array(list)

print("List in python : ", list)

print("Numpy Array in python :",

sample_array)

OutPut:-
List in python : [1, 2, 3, 4]
Numpy Array in python : [1 2 3 4]
Python for data science (3150713)

Multi-Dimensional Array:
Data in multidimensional arrays are stored in tabular form.

# importing numpy module

import numpy as np

# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]

# creating numpy array

sample_array = np.array([list_1,
list_2,
list_3])

print("Numpy multi dimensional array in python\n",

sample_array)

OutPut:-
Numpy multi dimensional array in python
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Python for data science (3150713)

Q-8. Differentiate rand and randn function in Numpy.

Feature Rand Randn

Purpose Generates random numbers Generates random numbers from a
from a uniform distribution standard normal distribution
Distribution Uniform distribution over Standard normal distribution
Type the interval [0.0, 1.0) (mean = 0, standard deviation = 1)
Range of [0.0, 1.0) (-∞, ∞)
Values
Function numpy.random.rand(d0, numpy.random.randn(d0, d1, ...,
Signature d1, ..., dn) dn)
Typical Use Random sampling, Generating data for statistical
Cases generating random floats modeling or simulations
Output Specified by the input Specified by the input dimensions
Shape dimensions

Q-9. List and Explain Numpy Aggregation functions with example.

1. numpy.sum(): Computes the sum of elements.
2. numpy.mean(): Computes the average of elements.
3. numpy.median(): Finds the median value.
4. numpy.std(): Calculates the standard deviation.
5. numpy.var(): Computes the variance.
6. numpy.min(): Finds the minimum value.
7. numpy.max(): Finds the maximum value.

1. numpy.sum()
Calculates the sum of array elements along a specified axis.
import numpy as np

array = np.array([[1, 2, 3], [4, 5, 6]])

# Sum of all elements

total_sum = np.sum(array)
print("Total Sum:", total_sum)
Python for data science (3150713)

# Sum along rows (axis=0)

sum_rows = np.sum(array, axis=0)
print("Sum along rows:", sum_rows)

# Sum along columns (axis=1)

sum_columns = np.sum(array, axis=1)
print("Sum along columns:", sum_columns)
OutPut:-
Total Sum: 21
Sum along rows: [5 7 9]
Sum along columns: [ 6 15]

2. numpy.mean()
Calculates the arithmetic mean of array elements along a specified axis.
mean_value = np.mean(array)
print("Mean:", mean_value)

mean_rows = np.mean(array, axis=0)

print("Mean along rows:", mean_rows)

mean_columns = np.mean(array, axis=1)

print("Mean along columns:", mean_columns)
OutPut:-
Mean: 3.5
Mean along rows: [2.5 3.5 4.5]
Mean along columns: [2. 5.]

3. numpy.median()
Calculates the median of array elements along a specified axis.
median_value = np.median(array)
print("Median:", median_value)

median_rows = np.median(array, axis=0)

print("Median along rows:", median_rows)

median_columns = np.median(array, axis=1)

print("Median along columns:", median_columns)
OutPut:-
Median: 3.5
Python for data science (3150713)

Median along rows: [2.5 3.5 4.5]

Median along columns: [2. 5.]

4. numpy.std()
Calculates the standard deviation of array elements along a specified
std_dev = np.std(array)
print("Standard Deviation:", std_dev)

std_dev_rows = np.std(array, axis=0)

print("Standard Deviation along rows:", std_dev_rows)

std_dev_columns = np.std(array, axis=1)

print("Standard Deviation along columns:", std_dev_columns)
OutPut:-
Standard Deviation: 1.707825127659933
Standard Deviation along rows: [1.5 1.5 1.5]
Standard Deviation along columns: [0.81649658 0.81649658]

5. numpy.var()
Calculates the variance of array elements along a specified axis.
variance = np.var(array)
print("Variance:", variance)

variance_rows = np.var(array, axis=0)

print("Variance along rows:", variance_rows)

variance_columns = np.var(array, axis=1)

print("Variance along columns:", variance_columns)
OutPut:-
Variance: 2.9166666666666665
Variance along rows: [2.25 2.25 2.25]
Variance along columns: [0.66666667 0.66666667]
Python for data science (3150713)

6. numpy.min()
Finds the minimum value in the array.
min_value = np.min(array)
print("Minimum Value:", min_value)

min_rows = np.min(array, axis=0)

print("Minimum along rows:", min_rows)

min_columns = np.min(array, axis=1)

print("Minimum along columns:", min_columns)
OutPut:-
Minimum Value: 1
Minimum along rows: [1 2 3]
Minimum along columns: [1 4]

7. numpy.max()
Finds the maximum value in the array.
max_value = np.max(array)
print("Maximum Value:", max_value)

max_rows = np.max(array, axis=0)

print("Maximum along rows:", max_rows)

max_columns = np.max(array, axis=1)

print("Maximum along columns:", max_columns)
OutPut:-
Maximum Value: 6
Maximum along rows: [4 5 6]
Maximum along columns: [3 6]
Python for data science (3150713)

Q-10 Explain Series in Pandas with example.

The Pandas Series can be defined as a one-dimensional array that is capable of
storing various data types. We can easily convert the list, tuple, and dictionary
into series using "series' method. The row labels of series are called the index. A
Series cannot contain multiple columns. It has the following parameter:

1. data: It can be any list, dictionary, or scalar value.

2. index: The value of the index should be unique and hashable. It must be of
the same length as data. If we do not pass any index, default np.arrange(n)
will be used.
3. dtype: It refers to the data type of series.
4. copy: It is used for copying the data.

Creating a Pandas Series

In the real world, a Pandas Series will be created by loading the datasets from
existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas
Series can be created from the lists, dictionary, and from a scalar value etc. Series
can be created in different ways, here are some ways by which we create a series:
Creating a series from array: In order to create a series from array, we have to
import a numpy module and have to use array() function.

import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Output:-
0 1
1 7
2 2
dtype: int64

Create a Series from list:

import pandas as pd

# a simple list
Python for data science (3150713)

list = ['g', 'e', 'e', 'k', 's']

# create series form a list

ser = pd.Series(list)
print(ser)
Output:-
0 1
1 7
2 2
dtype: int64

Q-11 Explain DataFrame in Pandas with example.

Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous
tabular data structure with labeled axes (rows and columns). A Data frame is a
two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows
and columns. Pandas DataFrame consists of three principal components, the data,
rows, and columns.

Create a DataFrame
We can create a DataFrame using following ways:
o dict
Python for data science (3150713)

o Lists
o Numpy ndarrrays
o Series

Create a DataFrame from Dict of Series:

# importing the pandas library
import pandas as pd

info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),
'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}

d1 = pd.DataFrame(info)
print (d1)
Output:-
one two
a 1.0 1
b 2.0 2
c 3.0 3
d 4.0 4
e 5.0 5
f 6.0 6
g NaN 7
h NaN 8

Create a DataFrame using List:

# importing the pandas library
import pandas as pd
# a list of strings
x = ['Python', 'Pandas', 'Java', 'PHP']

# Calling DataFrame constructor on list

df = pd.DataFrame(x)
print(df)
Output:-
0 Python
1 Pandas
2 Java
Python for data science (3150713)

3 PHP

Creating DataFrame from dict of ndarray/lists:

import pandas as pd

# intialise data of lists.

data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Age':[20, 21, 19, 18]}

# Create DataFrame
df = pd.DataFrame(data)

# Print the output.

print(df)
Output:-
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18

Operations on Series:
# Performing arithmetic operations
series_a = pd.Series([1, 2, 3])
series_b = pd.Series([4, 5, 6])

sum_series = series_a + series_b

print("\nSum of Two Series:")
print(sum_series)
Output:-
Sum of Two Series:
0 5
1 7
2 9
dtype: int64
Python for data science (3150713)

Q-12Explain Multi-Index DataFrame in pandas with

example.
Multi-index in Python Pandas
Multi-index allows you to select more than one row and column in your index.
It is a multi-level or hierarchical object for Pandas object.
We can use various methods of multi-index such as MultiIndex.from_arrays(),
MultiIndex.from_tuples(), MultiIndex.from_product(), MultiIndex.from_frame,
etc., which helps us to create multiple indexes from arrays, tuples, DataFrame,
etc.
Example 1: Creating multi-index from arrays
After importing all the important Python libraries, we are creating an array of
names along with arrays of marks and age respectively.
Now with the help of MultiIndex.from_arrays, we are combining all three arrays
such that elements from all three arrays form multiple indexes together. After that,
we show the above result.

# importing pandas library from

# python
import pandas as pd
arrays = ['Sohom','Suresh','kumkum','subrata']
age= [10, 11, 12, 13]
marks=[90,92,23,64]
multi_index= pd.MultiIndex.from_arrays([arrays,age,marks], names=('names',
'age','marks'))
print(multi_index)
Output: -
Python for data science (3150713)

Example 2: Creating multi-index from DataFrame using Pandas.

In this example, we are doing the same thing as the previous example. We created
a DataFrame using pd.DataFrame and after that, we created multi-index from that
DataFrame using multi-index.from_frame() along with the names.

import pandas as pd
# Creating data
Information = {'name': ["Saikat", "Shrestha", "Sandi", "Abinash"],

'Jobs': ["Software Developer", "System Engineer",

"Footballer", "Singer"],

'Annual Salary(L.P.A)': [12.4, 5.6, 9.3, 10]}

# Dataframing the whole data

df = pd.DataFrame(dict)
# Showing the above data
print(df)
OutPut: -
Python for data science (3150713)

Q-13. Explain Cross Section in DataFrame with Example.

Sometimes we need to find the cross-section of pandas series or data frame. Here
cross-section means getting values at the specified index, values at several
indexes, values at several indexes and levels or values at the specified column
and axis etc. There is a function known as pandas.DataFrame.xs() which will help
in this condition.
pandas.DataFrame.xs() takes a key argument in order to select data at the
particular level in MultiIndex and returns cross-section from pandas data frame.
• Syntax: DataFrame.xs(key, axis=0, level=None, drop_level=True)
Parameters:
• key – Label contained in the index, or partially in a MultiIndex.
• axis – Axis to retrieve cross-section on.
• level – In case of a key partially contained in a MultiIndex, indicate which
levels are used.
• drop_level – If False, returns an object with the same levels as self.
Returns:
• Cross-section from the original DataFrame

# importing pandas library

import pandas as pd

# Creating a Dictionary
animal_dict = {'num_of_legs': [4, 0, 4, 2, 2, 2],

'num_of_wings': [0, 0, 0, 2, 2, 2],

'class': ['Reptiles', 'Reptiles', 'Reptiles',

'Birds', 'Birds', 'Birds'],

'animal': ['Turtle', 'Snake', 'Crocodile',

'Parrot', 'Owl', 'Hummingbird'],

'locomotion': ['swim_walk', 'swim_crawl', 'swim_walk',

'flies', 'flies', 'flies']}

# Converting to Data frame and setting index

Python for data science (3150713)

df = pd.DataFrame(data=animal_dict)
df = df.set_index(['class', 'animal', 'locomotion'])

# Displaying Data frame

Df
OutPut:-

Q-14. Explain how to deal with missing data in Pandas.

Missing Data can occur when no information is provided for one or more items
or for a whole unit. Missing Data is a very big problem in a real-life scenarios.
Missing Data can also refer to as NA(Not Available) values in pandas. In
DataFrame sometimes many datasets simply arrive with missing data, either
because it exists and was not collected or it never existed. For Example, Suppose
different users being surveyed may choose not to share their income, some users
may choose not to share the address in this way many datasets went missing.

Pandas treat None and NaN as essentially interchangeable for indicating missing
or null values. To facilitate this convention, there are several useful functions for
detecting, removing, and replacing null values in Pandas DataFrame :
• isnull()
• notnull()
• dropna()
• fillna()
• replace()
• interpolate()
Python for data science (3150713)

1.pandas.isnull() Method
# importing pandas package
import pandas as pd

# making data frame from csv file

data = pd.read_csv("/content/employees.csv")

# creating bool series True for NaN values

bool_series = pd.isnull(data["Team"])

# filtering data
# displaying data only with team = NaN
data[bool_series]
OutPut:-

2. notnull() Method
# importing pandas package
import pandas as pd

# making data frame from csv file

data = pd.read_csv("/content/employees.csv")

# creating bool series False for NaN values

bool_series = pd.notnull(data["Gender"])

# displayed data only with team = NaN

data[bool_series]
OutPut:-
Python for data science (3150713)

3. dropna() Example
# importing pandas module
import pandas as pd

# making data frame from csv file

data = pd.read_csv("nba.csv")

# making new data frame with dropped NA values

new_data = data.dropna(axis=0, how='any')

# comparing sizes of data frames

print("Old data frame length:", len(data),
"\nNew data frame length:",
len(new_data),
"\nNumber of rows with at least 1 NA value: ",
(len(data)-len(new_data)))
OutPut:-
Old data frame length: 458
New data frame length: 364
Number of rows with at least 1 NA value: 94
Python for data science (3150713)

4. replace() Example
import pandas as pd

df = {
"Array_1": [49.50, 70],
"Array_2": [65.1, 49.50]
}

data = pd.DataFrame(df)

print(data.replace(49.50, 60))
OutPut:-
Array_1 Array_2
0 60.0 65.1
1 70.0 60.0

Q-15. Explain Groupby function in pandas with example.

The groupby function in Pandas is used to group data by one or more columns
and perform aggregation operations.
Basic Syntax:-
df.groupby(by=column_name)[column_to_aggregate].aggregate_function()

Example 1: Grouping by Single Column

Suppose we have a DataFrame df with columns Name, Age, and Sales.
import pandas as pd
# Create DataFrame
data = {'Name': ['John', 'Mary', 'John', 'Mary', 'David'],
'Age': [25, 31, 25, 31, 42],
'Sales': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)

# Group by Name and calculate total Sales

grouped_df = df.groupby('Name')['Sales'].sum()

print(grouped_df)
Python for data science (3150713)

Output:-

Name
David 500
John 400
Mary 600
Name: Sales, dtype: int64

Example 2: Grouping by Multiple Columns

# Group by Name and Age, and calculate mean Sales
grouped_df = df.groupby(['Name', 'Age'])['Sales'].mean()

print(grouped_df)
Output:-

Name Age
David 42 500.0
John 25 200.0
Mary 31 300.0
Name: Sales, dtype: float64

Q-16. Explain join function in pandas with example.

The join() method inserts column(s) from another DataFrame, or Series.
Syntax
dataframe.join(other, on, how, lsuffix, rsuffix, sort)

import pandas as pd

data1 = {
"name": ["Sally", "Mary", "John"],
"age": [50, 40, 30]
}

data2 = {
"qualified": [True, False, False]
}
Python for data science (3150713)

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

newdf = df1.join(df2)

print(newdf)
OutPut:-
name age qualified
0 Sally 50 True
1 Mary 40 False
2 John 30 False

Q-17. Explain merge function in pandas with example.

The merge operation in Pandas merges two DataFrames based on their indexes
or a specified column.
The merge() in Pandas works similar to JOINs in SQL.

The syntax of the merge() method in Pandas is:

pd.merge(left, right, on=None, how='inner', left_on=None, right_on=None,
sort=False)

import pandas as pd

# create dataframes from the dictionaries

data1 = {
'EmployeeID' : ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name' : ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'],
'DeptID': ['D001', 'D003', 'D001', 'D002', 'D003'],
}
employees = pd.DataFrame(data1)

data2 = {
'DeptID': ['D001', 'D002', 'D003'],
'DeptName': ['Sales', 'HR', 'Admin']
}
departments = pd.DataFrame(data2)
Python for data science (3150713)

# merge dataframes employees and departments

merged_df = pd.merge(employees, departments)

# display DataFrames
print("Employees:")
print(employees)
print()
print("Departments:")
print(departments)
print()
print("Merged DataFrame:")
print(merged_df)
OutPut:-
Employees:
EmployeeID Name DeptID
0 E001 John Doe D001
1 E002 Jane Smith D003
2 E003 Peter Brown D001
3 E004 Tom Johnson D002
4 E005 Rita Patel D003

Departments:
DeptID DeptName
0 D001 Sales
1 D002 HR
2 D003 Admin

Merged DataFrame:
EmployeeID Name DeptID DeptName
0 E001 John Doe D001 Sales
1 E003 Peter Brown D001 Sales
2 E002 Jane Smith D003 Admin
3 E005 Rita Patel D003 Admin
4 E004 Tom Johnson D002 HR
Python for data science (3150713)

Q-18. Diffrentiate join and merge functions in pandas.

Feature join() merge()

Primary Use Combines DataFrames based Combines DataFrames based
on their index. on columns or index.
Default Performs a left join by Performs an inner join by
Behavior default. default.
Join on Works primarily with You can specify columns or
Columns DataFrame index. You can index to merge on, providing
join on columns but need to more flexibility.
reset the index.
Join Types Supports left, right, inner, and Supports left, right, inner,
outer joins. outer, and cross joins.
Flexibility Less flexible for joining on More flexible. You can
columns. Ideal for joining by specify multiple columns or a
index. combination of columns and
index to merge on.
Syntax Simpler for joining on index. More versatile, as it allows
joining on columns or
indices.
Key Parameteron=None, which defaults to on, left_on, right_on, which
the DataFrame index. can be columns or indices.
Handling You can use lsuffix and You can specify suffixes to
Overlapping rsuffix to differentiate handle overlapping column
Column Names overlapping column names in name
the two DataFrames.
Python for data science (3150713)

Q-19. Explain read_csv function in pandas with example.

CSV files are the Comma Separated Files. To access data from the CSV file, we
require a function read_csv() from Pandas that retrieves data in the form of the
data frame.
Syntax of Pandas read_csv
Here is the Pandas read CSV syntax with its parameters.
Syntax:
pd.read_csv(filepath_or_buffer, sep=’ ,’ , header=’infer’, index_col=None,
usecols=None, engine=None, skiprows=None, nrows=None)

Read CSV File using Pandas read_csv

import pandas as pd
df = pd.read_csv('data.csv')
print(df)
OutPut:-

Q-20 Explain read_excel function in pandas with example.

Pandas read_excel()
1. The Pandas module read_excel() function reads the Excel sheet data into
a DataFrame object.
2. An Excel sheet is a two-dimensional table, and the DataFrame object
represents the two-dimensional tabular view.
3. We can get the header details from the Excel sheet.
4. We can also specify the columns to be read from the Excel sheet.
5. The DataFrame object has various useful methods to convert the Excel
data into CSV, Dictionary, or JSON representation.
Python for data science (3150713)

Pandas read_excel() Example

import pandas

excel_data_df = pandas.read_excel('records.xlsx', sheet_name='Employees')

# print whole sheet data

print(excel_data_df)
OutPut:-
EmpID EmpName EmpRole
0 1 Pankaj CEO
1 2 David Lee Editor
2 3 Lisa Ray Author

Q-21. Explain Web Scrapping with Example using

Beautiful Soup library.

Web Scraping is the process of extracting data from websites. It allows you to
collect and analyze large amounts of data from websites that don’t have an official
API. Python has several libraries for web scraping, and one of the most commonly
used ones is Beautiful Soup.
What is Beautiful Soup?
Beautiful Soup is a Python library that makes it easy to scrape information from
web pages. It allows you to parse HTML and XML documents and extract useful
data from them. It works with a parser, such as the built-in Python parser or third-
party parsers like lxml.
Key Steps in Web Scraping using Beautiful Soup:
1. Send a request to the webpage to retrieve its content.
2. Parse the HTML content of the webpage.
3. Navigate and search the parsed HTML tree to extract the required
information.
Python for data science (3150713)

4. Extract and store the scraped data.

Example: Web Scraping with Beautiful Soup

Let’s scrape a sample website to extract some data.
Libraries Needed:
• requests: To make a request to the webpage.
• Beautiful Soup: To parse the HTML content.

# Importing necessary libraries

import requests
from bs4 import BeautifulSoup

# Step 1: Send a request to the website

url = "https://example.com" # Replace with the website you want to scrape
response = requests.get(url)

# Step 2: Parse the HTML content with Beautiful Soup

soup = BeautifulSoup(response.text, 'html.parser')

# step 3: Extract all paragraph tags

paragraphs = soup.find_all('p')

# Step 4: Print the extracted paragraph data

for p in paragraphs:
print(p.text)

Python For Data Science
No ratings yet
Python For Data Science
17 pages
Lec 1 Introduction To Python
No ratings yet
Lec 1 Introduction To Python
26 pages
Lec-1-Introduction To Python
No ratings yet
Lec-1-Introduction To Python
25 pages
Lec 1 Introduction To Python
No ratings yet
Lec 1 Introduction To Python
23 pages
Py Chapter 1 Topic 5
No ratings yet
Py Chapter 1 Topic 5
7 pages
Datascience Notes Unit-1
No ratings yet
Datascience Notes Unit-1
19 pages
Python Handout Level5 Nit&Sod 062930
No ratings yet
Python Handout Level5 Nit&Sod 062930
68 pages
Week 1
No ratings yet
Week 1
121 pages
Python Module 1
No ratings yet
Python Module 1
9 pages
Python Notes
No ratings yet
Python Notes
106 pages
Python Notes
No ratings yet
Python Notes
80 pages
Python IDEs for Data Science Overview
No ratings yet
Python IDEs for Data Science Overview
18 pages
Python All
No ratings yet
Python All
253 pages
Unit 2
No ratings yet
Unit 2
45 pages
Big Data Lecture # 2
No ratings yet
Big Data Lecture # 2
10 pages
Python 2
No ratings yet
Python 2
16 pages
Module03-Introduction To Python
No ratings yet
Module03-Introduction To Python
40 pages
Lec 2
No ratings yet
Lec 2
18 pages
Prac1 AAM
No ratings yet
Prac1 AAM
6 pages
Python Introduction
No ratings yet
Python Introduction
38 pages
A Crash Course in Python For Data Science
No ratings yet
A Crash Course in Python For Data Science
30 pages
Python For Web Development Pre
No ratings yet
Python For Web Development Pre
15 pages
Python Programming Basics and IDEs
No ratings yet
Python Programming Basics and IDEs
20 pages
Python Unit 1 & 2
No ratings yet
Python Unit 1 & 2
16 pages
Python For Data Analytics
67% (3)
Python For Data Analytics
69 pages
Intro To Python and IDE
No ratings yet
Intro To Python and IDE
2 pages
Introduction To Python
No ratings yet
Introduction To Python
26 pages
Basics of Python Programming and Statistics
No ratings yet
Basics of Python Programming and Statistics
56 pages
Association Seminar On Python Tools 26-09-23
No ratings yet
Association Seminar On Python Tools 26-09-23
5 pages
Python Programming Essentials
No ratings yet
Python Programming Essentials
323 pages
Data Ty
No ratings yet
Data Ty
59 pages
Session 1
No ratings yet
Session 1
34 pages
Python U-5
No ratings yet
Python U-5
76 pages
Installing Python and Python IDEs
No ratings yet
Installing Python and Python IDEs
30 pages
2.introduction For Python
No ratings yet
2.introduction For Python
22 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
Python U-5 Combined Notes
No ratings yet
Python U-5 Combined Notes
76 pages
Py Chapter 1 Topic 1
No ratings yet
Py Chapter 1 Topic 1
7 pages
Eguide of Cloud Data Engineering
No ratings yet
Eguide of Cloud Data Engineering
23 pages
PR1SR
No ratings yet
PR1SR
5 pages
Unit+1+ +Python+for+DS
No ratings yet
Unit+1+ +Python+for+DS
8 pages
Python Basic
No ratings yet
Python Basic
145 pages
Micro Project Report Format
No ratings yet
Micro Project Report Format
11 pages
Igual-SeguÃ 2017 Chapter ToolboxesForDataScientists
No ratings yet
Igual-SeguÃ 2017 Chapter ToolboxesForDataScientists
24 pages
Python Machine Learning Guide
No ratings yet
Python Machine Learning Guide
4 pages
2 Minutes Crash Course On Python For Begineers
No ratings yet
2 Minutes Crash Course On Python For Begineers
6 pages
Python Unit 1: OOP Concepts Overview
100% (1)
Python Unit 1: OOP Concepts Overview
12 pages
Programming Basics
No ratings yet
Programming Basics
11 pages
Data Analysis of Visualization: CHAPTER - 1 Preliminaries
No ratings yet
Data Analysis of Visualization: CHAPTER - 1 Preliminaries
93 pages
Unit II 01 Course Work
No ratings yet
Unit II 01 Course Work
5 pages
Internship Project Ppt-1
No ratings yet
Internship Project Ppt-1
23 pages
Python for Developers & Analysts
No ratings yet
Python for Developers & Analysts
23 pages
Student System Management
No ratings yet
Student System Management
18 pages
Python Libraries Seminar Report
100% (2)
Python Libraries Seminar Report
16 pages
Python for Data Analysis Basics
100% (5)
Python for Data Analysis Basics
37 pages
Fds Lab Manual
No ratings yet
Fds Lab Manual
72 pages
Application Based Programming in Python - ACE - INTL
No ratings yet
Application Based Programming in Python - ACE - INTL
466 pages
Lecture Notes On Introduction To Python Programming Final
No ratings yet
Lecture Notes On Introduction To Python Programming Final
68 pages
PDS - Chapter 4
No ratings yet
PDS - Chapter 4
25 pages
PDS Chapter 5
No ratings yet
PDS Chapter 5
13 pages
PDS Chapter 1
No ratings yet
PDS Chapter 1
15 pages
PDS Chapter 2
No ratings yet
PDS Chapter 2
10 pages
Ultimate Python Programming Deepali Srivastava
No ratings yet
Ultimate Python Programming Deepali Srivastava
27 pages
Python Basics for New Coders
No ratings yet
Python Basics for New Coders
5 pages
Quick Python Learning Guide
No ratings yet
Quick Python Learning Guide
13 pages
Rishi Kumar - Java Developer
No ratings yet
Rishi Kumar - Java Developer
4 pages
Python Log File Management in Full Detail - 1
No ratings yet
Python Log File Management in Full Detail - 1
4 pages
6th Sem Cse Data Science Analytics SM o
No ratings yet
6th Sem Cse Data Science Analytics SM o
40 pages
Server Hosting Management System (Ip Class 12) (2024-25)
No ratings yet
Server Hosting Management System (Ip Class 12) (2024-25)
21 pages
Pygithub PDF
100% (1)
Pygithub PDF
178 pages
Problem Solving & Python Basics
No ratings yet
Problem Solving & Python Basics
151 pages
DobotLab User Guide - V2.3.1 - 20250213 - en
No ratings yet
DobotLab User Guide - V2.3.1 - 20250213 - en
100 pages
Introduction To Python Programming Quiz Assignment
No ratings yet
Introduction To Python Programming Quiz Assignment
8 pages
3779992-Worksheet - Class Ix - Ai - Part B Unit5 - Introduction To Python
No ratings yet
3779992-Worksheet - Class Ix - Ai - Part B Unit5 - Introduction To Python
5 pages
Introduction to Python Programming
No ratings yet
Introduction to Python Programming
20 pages
Python Training in Pune
No ratings yet
Python Training in Pune
3 pages
LTI Interview Experience
No ratings yet
LTI Interview Experience
22 pages
Introduction of Python
No ratings yet
Introduction of Python
9 pages
Road Map For Learning Django
No ratings yet
Road Map For Learning Django
2 pages
Kiran Bhimani: Software Developer Resume
No ratings yet
Kiran Bhimani: Software Developer Resume
2 pages
Unit Ii PSPP
No ratings yet
Unit Ii PSPP
14 pages
Data Science Careers Guide Springboard Final
No ratings yet
Data Science Careers Guide Springboard Final
75 pages
Basic Python MCQs and Answers
No ratings yet
Basic Python MCQs and Answers
21 pages
Uii Data, Expressions, Statements-1
No ratings yet
Uii Data, Expressions, Statements-1
28 pages
Introduction
No ratings yet
Introduction
5 pages
Class 9 - Python Turtle Module
No ratings yet
Class 9 - Python Turtle Module
62 pages
Python
No ratings yet
Python
3 pages
Index
No ratings yet
Index
4 pages
Python Core and Advanced Concepts Guide
No ratings yet
Python Core and Advanced Concepts Guide
6 pages
Python File Manipulation Guide
No ratings yet
Python File Manipulation Guide
12 pages
Radium Calculations for Dark Matter Detection
No ratings yet
Radium Calculations for Dark Matter Detection
1 page
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
49 pages