Python for data science (3150713)
Chepter 3:- Getting Your Hands Dirty With Data
Q-1. List different IDE of Pythons. Explain advantages and
disadvantages of each.
1. PyCharm
• Description: A dedicated Python IDE from JetBrains, available in
Community (free) and Professional (paid) editions.
Advantages:
• Smart Code Assistance: Advanced code completion, error detection, and
code refactoring.
• Debugger: Robust debugging capabilities with breakpoints, variable
watching, and step-through execution.
• Integrated Tools: Support for version control systems (e.g., Git), Docker,
and databases.
• Professional Edition: Supports web development frameworks like Django
and Flask.
Disadvantages:
• Heavy Resource Usage: High memory and CPU usage can slow down
performance, especially for large projects.
• Steep Learning Curve: Can be overwhelming for beginners.
• Professional Edition is Paid: The free Community edition lacks some
advanced features.
2. Visual Studio Code (VS Code)
• Description: A lightweight, open-source code editor from Microsoft, with
Python support via extensions.
Advantages:
• Extensibility: Highly customizable through extensions, including Python,
Git, Docker, and more.
• Integrated Debugging: Offers built-in debugging tools for Python.
Python for data science (3150713)
• Performance: Lightweight and fast compared to full-fledged IDEs.
• Cross-platform: Runs on Windows, macOS, and Linux.
Disadvantages:
• Requires Configuration: Needs extensions to achieve full Python
functionality, which may take some time to set up.
• Can Become Slow: Installing too many extensions can affect performance.
• Not as Feature-Rich: Lacks some of the advanced features found in
dedicated IDEs like PyCharm out of the box.
3. Spyder
• Description: An open-source IDE tailored for data science and scientific
computing, often used in conjunction with packages like NumPy and
pandas.
Advantages:
• Data Science Oriented: Built-in integration with data analysis and
visualization libraries like Matplotlib and pandas.
• Variable Explorer: Allows inspection and manipulation of data variables in
memory, making it great for data analysis.
• Lightweight: Easy to set up and use for quick scripts or data analysis tasks.
Disadvantages:
• Limited Features: Lacks some advanced features found in full-fledged
IDEs (e.g., web development tools).
• Not Suited for Larger Applications: Geared towards scientific computing,
not complex software development.
• Basic UI: The interface feels outdated compared to modern IDEs.
4. Jupyter Notebook / JupyterLab
• Description: A web-based interactive development environment widely
used in data science and machine learning.
Advantages:
Python for data science (3150713)
• Interactive: Supports live code execution, visualization, and markdown,
making it ideal for iterative development.
• Visualization: Seamless integration with libraries like Matplotlib,
TensorFlow, and pandas for plotting and data analysis.
• Notebooks: Shareable notebooks are great for collaboration and
presentations.
Disadvantages:
• Not for Large Projects: Managing complex projects can be challenging.
• Basic Debugging: Lacks robust debugging and error-checking features.
• Can be Slow: Performance issues can arise with large datasets or long
notebooks.
5. Thonny
• Description: A simple Python IDE designed for beginners, offering an
easy-to-use interface with debugging tools.
Advantages:
• Beginner-Friendly: Designed for learners with a simple interface and a
focus on ease of use.
• Integrated Debugger: Step-through debugging with visual representations
of variables and expressions.
• Lightweight: Small and fast, ideal for learning Python without complex
configurations.
Disadvantages:
• Limited Features: Lacks advanced tools and features needed for large or
complex projects.
• Not Customizable: Fewer customization and extension options compared
to more advanced IDEs.
• Not Suitable for Advanced Users: Lacks support for advanced libraries and
web frameworks.
Python for data science (3150713)
6. IDLE
• Description: The default IDE that comes bundled with Python, offering a
basic environment for simple scripts.
Advantages:
• Pre-installed: Comes with Python, so no need to install separately.
• Simple Interface: Very easy to use, especially for beginners.
• Lightweight: Minimalist design with fast startup and execution.
Disadvantages:
• Lack of Features: Very basic in terms of functionality, lacks features like
code analysis, project management, and integration with version control.
• Not for Large Projects: Unsuitable for complex or large-scale applications.
• Outdated UI: The user interface is quite dated compared to modern IDEs.
7. Wing IDE
• Description: A professional-grade Python IDE focused on providing
advanced debugging and development tools.
Advantages:
• Powerful Debugger: Provides in-depth debugging tools with breakpoints,
call stacks, and variable inspection.
• Professional Features: Designed for professional development, with
support for unit testing, code refactoring, and Django integration.
• Customizable: Extensive options to customize the development
environment.
Disadvantages:
• Paid Software: Free version is limited in features, and the full version is
paid.
• Resource-Intensive: Can be slow, especially when working with large
projects.
• Steeper Learning Curve: More complex to use for beginners.
Python for data science (3150713)
Q-2. Write a short note on Jupyter notebooks.
Jupyter Notebooks are an open-source, web-based platform that allows users to
create and share documents that combine live code, visualizations, and narrative
text.
Primarily used in data science, machine learning, and scientific research, they
support multiple programming languages like Python, R, and Julia.
Jupyter Notebooks offer an interactive environment where users can write and
execute code in cells, display results and document their workflows using
markdown text.
This makes them ideal for data exploration, analysis, visualization, and sharing
reproducible research. Notebooks can be easily shared or exported in various
formats, facilitating collaboration and communication.
Key Features:
1. Interactive coding: Write and execute code in cells.
2. Multi-language support: Supports over 40 programming languages, including
Python, R, Julia, and MATLAB.
3. Rich text editing: Include LaTeX equations, images, and HTML content.
4. Visualization: Integrate plots, charts, and graphs from popular libraries like
Matplotlib and Seaborn.
5. Collaboration: Share notebooks via URL or export to various formats (e.g.,
PDF, HTML, Markdown).
6. Extensive libraries: Access thousands of libraries and tools, including NumPy,
Pandas, and Scikit-learn.
Benefits:
1. Data exploration and visualization
2. Prototyping and testing
3. Education and research
4. Collaboration and communication
Python for data science (3150713)
5. Reproducibility and transparency
Applications:
1. Data science and analytics
2. Machine learning and AI
3. Scientific computing and research
4. Education and teaching
5. Business intelligence and reporting
Q-3. Explain Basic IO operations in Python.
Basic I/O (Input/Output) operations in Python involve reading from input sources
like the keyboard and writing output to destinations like the console or a file.
Here’s a breakdown of these operations:
1. Input Operations:
Python provides the input() function to get input from users.
Syntax:
user_input = input("Prompt message: ")
• Explanation: The input() function displays the given prompt and waits for
the user to type something. It reads the user input as a string.
Example:
name = input("Enter your name: ")
print("Hello, " + name)
If the user types "Alice," the output will be:
Hello, Alice
Python for data science (3150713)
For numeric input, you can convert the string to an integer or float:
age = int(input("Enter your age: "))
height = float(input("Enter your height: "))
2. Output Operations:
The print() function is used to output data to the console.
Syntax:
print(*objects, sep=' ', end='\n')
• objects: One or more expressions you want to print.
• sep: Optional. Specifies a separator between the objects (default is a space).
• end: Optional. Specifies what to print at the end (default is a newline \n).
Example:
print("Hello, world!")
Output:
Hello, world!
You can print multiple objects, customize separators, and the end character:
print("Hello", "World", sep='-', end='!')
Output:
Hello-World!
3. File I/O Operations:
Python also allows reading from and writing to files using built-in functions.
Opening a File:
You can open a file using the open() function.
file = open('filename', mode)
Python for data science (3150713)
• filename: Name of the file to be opened.
• mode: Specifies the mode for opening the file (read, write, append, etc.).
Common modes:
• 'r': Read (default mode, file must exist).
• 'w': Write (creates a new file or overwrites if it exists).
• 'a': Append (adds content to the end of the file).
• 'r+': Read and write.
Writing to a File:
To write data to a file, use the write() or writelines() function.
file = open('example.txt', 'w')
file.write("Hello, world!")
file.close()
Reading from a File:
You can read the contents of a file using the read(), readline(), or readlines()
functions.
file = open('example.txt', 'r')
content = file.read()
print(content)
file.close()
Best Practice: Use with Statement:
The with statement ensures that the file is properly closed after its operations are
done.
with open('example.txt', 'r') as file:
content = file.read()
print(content)
Python for data science (3150713)
Q-4. Write a short note on Data Conditioning.
Data Conditioning: Preparing Data for Analysis
Data conditioning is the process of preparing raw data for analysis by cleaning,
transforming, and formatting it to improve its quality, consistency, and reliability.
The goal is to ensure data accuracy, completeness, and relevance for statistical
modeling, machine learning, or data visualization.
Key Steps in Data Conditioning:
1. Data Cleaning: Identify and correct errors, handle missing values, and remove
duplicates or irrelevant data.
2. Data Transformation: Convert data types (e.g., text to numerical),
scale/normalize values, and perform feature engineering.
3. Data Integration: Combine data from multiple sources, handle inconsistencies,
and perform data merging.
4. Data Reduction: Select relevant features, reduce dimensionality, and remove
noisy or redundant data.
5. Data Formatting: Reorganize data structures, handle date/time formats, and
ensure data consistency.
Techniques Used:
1. Handling missing values (imputation, interpolation)
2. Data normalization (min-max scaling, standardization)
3. Feature scaling (log transformation, standardization)
4. Encoding categorical variables (one-hot encoding, label encoding)
5. Data aggregation (grouping, summarization)
Python for data science (3150713)
Benefits:
1. Improved data quality and accuracy
2. Enhanced model performance and reliability
3. Better data visualization and insights
4. Reduced errors and biases
5. Increased efficiency in data analysis
Tools and Technologies:
1. Pandas and NumPy (Python libraries)
2. R programming language
3. Data preprocessing tools (e.g., OpenRefine, Trifacta)
4. Machine learning frameworks (e.g., scikit-learn, TensorFlow)
Q-5. Write a short note on Data Shaping.
Data Shaping: Reshaping Data for Analysis
Data shaping is the process of transforming and restructuring data from its raw
format into a suitable format for analysis, visualization, or modeling. It involves
rearranging data to facilitate insights, improve data quality, and enhance
compatibility with various tools and techniques.
Types of Data Shaping:
1. Pivoting: Rotating data from long to wide format or vice versa.
2. Merging: Combining data from multiple sources.
3. Aggregating: Grouping data and calculating summaries.
4. Reshaping: Changing data structure (e.g., from flat to hierarchical).
5. Transposing: Swapping rows and columns.
Python for data science (3150713)
Data Shaping Techniques:
1. Data stacking: Combining multiple datasets.
2. Data unstacking: Separating stacked data.
3. Data melting: Converting wide data to long format.
4. Data casting: Converting data types.
Benefits:
1. Improved data readability
2. Enhanced data analysis capabilities
3. Better data visualization
4. Increased compatibility with tools and models
5. Simplified data management
By applying data shaping techniques, data analysts and scientists can transform
raw data into a meaningful, analysis-ready format, unlocking insights and
informing business decisions.
Some examples of data shaping functions in Python using Pandas library:
import pandas as pd
# Create a sample dataframe
data = {'Name': ['John', 'Mary', 'David'],
'Age': [25, 31, 42],
'Country': ['USA', 'Canada', 'UK']}
df = pd.DataFrame(data)
# Pivot data
pivoted_df = pd.pivot_table(df, values='Age', index='Name',
columns='Country')
Python for data science (3150713)
# Merge data
df1 = pd.DataFrame({'Name': ['John', 'Mary'], 'Gender': ['Male', 'Female']})
merged_df = pd.merge(df, df1, on='Name')
# Reshape data
reshaped_df = pd.melt(df, id_vars='Name', value_vars='Age')
print(pivoted_df)
print(merged_df)
print(reshaped_df)
Output:-
Country Canada UK USA
Name
David NaN 42.0 NaN
John NaN NaN 25.0
Mary 31.0 NaN NaN
Name Age Country Gender
0 John 25 USA Male
1 Mary 31 Canada Female
Name variable value
0 John Age 25
1 Mary Age 31
2 David Age 42
Python for data science (3150713)
Q-6. Differentiate Numpy and Pandas
Pandas NumPy
When we have to work on Tabular data, we When we have to work on Numerical
prefer the pandas module. data, we prefer the NumPy module.
The powerful tools of pandas are DataFrame Whereas the powerful tool
and Series. of NumPy is Arrays.
Pandas consume more memory. Numpy is memory efficient.
Pandas have a better performance when the Numpy has a better performance when
number of rows is 500K or more. number of rows is 50K or less.
Indexing of the Pandas series is very slow as Indexing of Numpy arrays is very fast.
compared to Numpy arrays.
Pandas have a 2D table object Numpy is capable of providing multi-
called DataFrame. dimensional arrays.
It was developed by Wes McKinney and was It was developed by Travis Oliphant and
released in 2008. was released in 2005.
It is being used in organizations like
It is used in a lot of organizations like Kaidee,
Walmart Tokopedia, Instacart, and many
Trivago, Abeja Inc., and a lot more.
more.
It has a higher industry application. It has a lower industry application.
Python for data science (3150713)
Q-7. Explain Numpy Array with example.
NumPy stands for Numerical Python. It is a Python library used for working with
an array. In Python, we use the list for the array but it’s slow to process. NumPy
array is a powerful N-dimensional array object and is used in linear algebra,
Fourier transform, and random number capabilities. It provides an array object
much faster than traditional Python lists.
Types of Array:
1. One Dimensional Array
2. Multi-Dimensional Array
One Dimensional Array:
A one-dimensional array is a type of linear array.
# importing numpy module
import numpy as np
# creating list
list = [1, 2, 3, 4]
# creating numpy array
sample_array = np.array(list)
print("List in python : ", list)
print("Numpy Array in python :",
sample_array)
OutPut:-
List in python : [1, 2, 3, 4]
Numpy Array in python : [1 2 3 4]
Python for data science (3150713)
Multi-Dimensional Array:
Data in multidimensional arrays are stored in tabular form.
# importing numpy module
import numpy as np
# creating list
list_1 = [1, 2, 3, 4]
list_2 = [5, 6, 7, 8]
list_3 = [9, 10, 11, 12]
# creating numpy array
sample_array = np.array([list_1,
list_2,
list_3])
print("Numpy multi dimensional array in python\n",
sample_array)
OutPut:-
Numpy multi dimensional array in python
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Python for data science (3150713)
Q-8. Differentiate rand and randn function in Numpy.
Feature Rand Randn
Purpose Generates random numbers Generates random numbers from a
from a uniform distribution standard normal distribution
Distribution Uniform distribution over Standard normal distribution
Type the interval [0.0, 1.0) (mean = 0, standard deviation = 1)
Range of [0.0, 1.0) (-∞, ∞)
Values
Function numpy.random.rand(d0, numpy.random.randn(d0, d1, ...,
Signature d1, ..., dn) dn)
Typical Use Random sampling, Generating data for statistical
Cases generating random floats modeling or simulations
Output Specified by the input Specified by the input dimensions
Shape dimensions
Q-9. List and Explain Numpy Aggregation functions with example.
1. numpy.sum(): Computes the sum of elements.
2. numpy.mean(): Computes the average of elements.
3. numpy.median(): Finds the median value.
4. numpy.std(): Calculates the standard deviation.
5. numpy.var(): Computes the variance.
6. numpy.min(): Finds the minimum value.
7. numpy.max(): Finds the maximum value.
1. numpy.sum()
Calculates the sum of array elements along a specified axis.
import numpy as np
array = np.array([[1, 2, 3], [4, 5, 6]])
# Sum of all elements
total_sum = np.sum(array)
print("Total Sum:", total_sum)
Python for data science (3150713)
# Sum along rows (axis=0)
sum_rows = np.sum(array, axis=0)
print("Sum along rows:", sum_rows)
# Sum along columns (axis=1)
sum_columns = np.sum(array, axis=1)
print("Sum along columns:", sum_columns)
OutPut:-
Total Sum: 21
Sum along rows: [5 7 9]
Sum along columns: [ 6 15]
2. numpy.mean()
Calculates the arithmetic mean of array elements along a specified axis.
mean_value = np.mean(array)
print("Mean:", mean_value)
mean_rows = np.mean(array, axis=0)
print("Mean along rows:", mean_rows)
mean_columns = np.mean(array, axis=1)
print("Mean along columns:", mean_columns)
OutPut:-
Mean: 3.5
Mean along rows: [2.5 3.5 4.5]
Mean along columns: [2. 5.]
3. numpy.median()
Calculates the median of array elements along a specified axis.
median_value = np.median(array)
print("Median:", median_value)
median_rows = np.median(array, axis=0)
print("Median along rows:", median_rows)
median_columns = np.median(array, axis=1)
print("Median along columns:", median_columns)
OutPut:-
Median: 3.5
Python for data science (3150713)
Median along rows: [2.5 3.5 4.5]
Median along columns: [2. 5.]
4. numpy.std()
Calculates the standard deviation of array elements along a specified
std_dev = np.std(array)
print("Standard Deviation:", std_dev)
std_dev_rows = np.std(array, axis=0)
print("Standard Deviation along rows:", std_dev_rows)
std_dev_columns = np.std(array, axis=1)
print("Standard Deviation along columns:", std_dev_columns)
OutPut:-
Standard Deviation: 1.707825127659933
Standard Deviation along rows: [1.5 1.5 1.5]
Standard Deviation along columns: [0.81649658 0.81649658]
5. numpy.var()
Calculates the variance of array elements along a specified axis.
variance = np.var(array)
print("Variance:", variance)
variance_rows = np.var(array, axis=0)
print("Variance along rows:", variance_rows)
variance_columns = np.var(array, axis=1)
print("Variance along columns:", variance_columns)
OutPut:-
Variance: 2.9166666666666665
Variance along rows: [2.25 2.25 2.25]
Variance along columns: [0.66666667 0.66666667]
Python for data science (3150713)
6. numpy.min()
Finds the minimum value in the array.
min_value = np.min(array)
print("Minimum Value:", min_value)
min_rows = np.min(array, axis=0)
print("Minimum along rows:", min_rows)
min_columns = np.min(array, axis=1)
print("Minimum along columns:", min_columns)
OutPut:-
Minimum Value: 1
Minimum along rows: [1 2 3]
Minimum along columns: [1 4]
7. numpy.max()
Finds the maximum value in the array.
max_value = np.max(array)
print("Maximum Value:", max_value)
max_rows = np.max(array, axis=0)
print("Maximum along rows:", max_rows)
max_columns = np.max(array, axis=1)
print("Maximum along columns:", max_columns)
OutPut:-
Maximum Value: 6
Maximum along rows: [4 5 6]
Maximum along columns: [3 6]
Python for data science (3150713)
Q-10 Explain Series in Pandas with example.
The Pandas Series can be defined as a one-dimensional array that is capable of
storing various data types. We can easily convert the list, tuple, and dictionary
into series using "series' method. The row labels of series are called the index. A
Series cannot contain multiple columns. It has the following parameter:
1. data: It can be any list, dictionary, or scalar value.
2. index: The value of the index should be unique and hashable. It must be of
the same length as data. If we do not pass any index, default np.arrange(n)
will be used.
3. dtype: It refers to the data type of series.
4. copy: It is used for copying the data.
Creating a Pandas Series
In the real world, a Pandas Series will be created by loading the datasets from
existing storage, storage can be SQL Database, CSV file, and Excel file. Pandas
Series can be created from the lists, dictionary, and from a scalar value etc. Series
can be created in different ways, here are some ways by which we create a series:
Creating a series from array: In order to create a series from array, we have to
import a numpy module and have to use array() function.
import pandas as pd
a = [1, 7, 2]
myvar = pd.Series(a)
print(myvar)
Output:-
0 1
1 7
2 2
dtype: int64
Create a Series from list:
import pandas as pd
# a simple list
Python for data science (3150713)
list = ['g', 'e', 'e', 'k', 's']
# create series form a list
ser = pd.Series(list)
print(ser)
Output:-
0 1
1 7
2 2
dtype: int64
Q-11 Explain DataFrame in Pandas with example.
Pandas DataFrame is two-dimensional size-mutable, potentially heterogeneous
tabular data structure with labeled axes (rows and columns). A Data frame is a
two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows
and columns. Pandas DataFrame consists of three principal components, the data,
rows, and columns.
Create a DataFrame
We can create a DataFrame using following ways:
o dict
Python for data science (3150713)
o Lists
o Numpy ndarrrays
o Series
Create a DataFrame from Dict of Series:
# importing the pandas library
import pandas as pd
info = {'one' : pd.Series([1, 2, 3, 4, 5, 6], index=['a', 'b', 'c', 'd', 'e', 'f']),
'two' : pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])}
d1 = pd.DataFrame(info)
print (d1)
Output:-
one two
a 1.0 1
b 2.0 2
c 3.0 3
d 4.0 4
e 5.0 5
f 6.0 6
g NaN 7
h NaN 8
Create a DataFrame using List:
# importing the pandas library
import pandas as pd
# a list of strings
x = ['Python', 'Pandas', 'Java', 'PHP']
# Calling DataFrame constructor on list
df = pd.DataFrame(x)
print(df)
Output:-
0 Python
1 Pandas
2 Java
Python for data science (3150713)
3 PHP
Creating DataFrame from dict of ndarray/lists:
import pandas as pd
# intialise data of lists.
data = {'Name':['Tom', 'nick', 'krish', 'jack'],
'Age':[20, 21, 19, 18]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
Output:-
Name Age
0 Tom 20
1 nick 21
2 krish 19
3 jack 18
Operations on Series:
# Performing arithmetic operations
series_a = pd.Series([1, 2, 3])
series_b = pd.Series([4, 5, 6])
sum_series = series_a + series_b
print("\nSum of Two Series:")
print(sum_series)
Output:-
Sum of Two Series:
0 5
1 7
2 9
dtype: int64
Python for data science (3150713)
Q-12Explain Multi-Index DataFrame in pandas with
example.
Multi-index in Python Pandas
Multi-index allows you to select more than one row and column in your index.
It is a multi-level or hierarchical object for Pandas object.
We can use various methods of multi-index such as MultiIndex.from_arrays(),
MultiIndex.from_tuples(), MultiIndex.from_product(), MultiIndex.from_frame,
etc., which helps us to create multiple indexes from arrays, tuples, DataFrame,
etc.
Example 1: Creating multi-index from arrays
After importing all the important Python libraries, we are creating an array of
names along with arrays of marks and age respectively.
Now with the help of MultiIndex.from_arrays, we are combining all three arrays
such that elements from all three arrays form multiple indexes together. After that,
we show the above result.
# importing pandas library from
# python
import pandas as pd
arrays = ['Sohom','Suresh','kumkum','subrata']
age= [10, 11, 12, 13]
marks=[90,92,23,64]
multi_index= pd.MultiIndex.from_arrays([arrays,age,marks], names=('names',
'age','marks'))
print(multi_index)
Output: -
Python for data science (3150713)
Example 2: Creating multi-index from DataFrame using Pandas.
In this example, we are doing the same thing as the previous example. We created
a DataFrame using pd.DataFrame and after that, we created multi-index from that
DataFrame using multi-index.from_frame() along with the names.
import pandas as pd
# Creating data
Information = {'name': ["Saikat", "Shrestha", "Sandi", "Abinash"],
'Jobs': ["Software Developer", "System Engineer",
"Footballer", "Singer"],
'Annual Salary(L.P.A)': [12.4, 5.6, 9.3, 10]}
# Dataframing the whole data
df = pd.DataFrame(dict)
# Showing the above data
print(df)
OutPut: -
Python for data science (3150713)
Q-13. Explain Cross Section in DataFrame with Example.
Sometimes we need to find the cross-section of pandas series or data frame. Here
cross-section means getting values at the specified index, values at several
indexes, values at several indexes and levels or values at the specified column
and axis etc. There is a function known as pandas.DataFrame.xs() which will help
in this condition.
pandas.DataFrame.xs() takes a key argument in order to select data at the
particular level in MultiIndex and returns cross-section from pandas data frame.
• Syntax: DataFrame.xs(key, axis=0, level=None, drop_level=True)
Parameters:
• key – Label contained in the index, or partially in a MultiIndex.
• axis – Axis to retrieve cross-section on.
• level – In case of a key partially contained in a MultiIndex, indicate which
levels are used.
• drop_level – If False, returns an object with the same levels as self.
Returns:
• Cross-section from the original DataFrame
# importing pandas library
import pandas as pd
# Creating a Dictionary
animal_dict = {'num_of_legs': [4, 0, 4, 2, 2, 2],
'num_of_wings': [0, 0, 0, 2, 2, 2],
'class': ['Reptiles', 'Reptiles', 'Reptiles',
'Birds', 'Birds', 'Birds'],
'animal': ['Turtle', 'Snake', 'Crocodile',
'Parrot', 'Owl', 'Hummingbird'],
'locomotion': ['swim_walk', 'swim_crawl', 'swim_walk',
'flies', 'flies', 'flies']}
# Converting to Data frame and setting index
Python for data science (3150713)
df = pd.DataFrame(data=animal_dict)
df = df.set_index(['class', 'animal', 'locomotion'])
# Displaying Data frame
Df
OutPut:-
Q-14. Explain how to deal with missing data in Pandas.
Missing Data can occur when no information is provided for one or more items
or for a whole unit. Missing Data is a very big problem in a real-life scenarios.
Missing Data can also refer to as NA(Not Available) values in pandas. In
DataFrame sometimes many datasets simply arrive with missing data, either
because it exists and was not collected or it never existed. For Example, Suppose
different users being surveyed may choose not to share their income, some users
may choose not to share the address in this way many datasets went missing.
Pandas treat None and NaN as essentially interchangeable for indicating missing
or null values. To facilitate this convention, there are several useful functions for
detecting, removing, and replacing null values in Pandas DataFrame :
• isnull()
• notnull()
• dropna()
• fillna()
• replace()
• interpolate()
Python for data science (3150713)
1.pandas.isnull() Method
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("/content/employees.csv")
# creating bool series True for NaN values
bool_series = pd.isnull(data["Team"])
# filtering data
# displaying data only with team = NaN
data[bool_series]
OutPut:-
2. notnull() Method
# importing pandas package
import pandas as pd
# making data frame from csv file
data = pd.read_csv("/content/employees.csv")
# creating bool series False for NaN values
bool_series = pd.notnull(data["Gender"])
# displayed data only with team = NaN
data[bool_series]
OutPut:-
Python for data science (3150713)
3. dropna() Example
# importing pandas module
import pandas as pd
# making data frame from csv file
data = pd.read_csv("nba.csv")
# making new data frame with dropped NA values
new_data = data.dropna(axis=0, how='any')
# comparing sizes of data frames
print("Old data frame length:", len(data),
"\nNew data frame length:",
len(new_data),
"\nNumber of rows with at least 1 NA value: ",
(len(data)-len(new_data)))
OutPut:-
Old data frame length: 458
New data frame length: 364
Number of rows with at least 1 NA value: 94
Python for data science (3150713)
4. replace() Example
import pandas as pd
df = {
"Array_1": [49.50, 70],
"Array_2": [65.1, 49.50]
}
data = pd.DataFrame(df)
print(data.replace(49.50, 60))
OutPut:-
Array_1 Array_2
0 60.0 65.1
1 70.0 60.0
Q-15. Explain Groupby function in pandas with example.
The groupby function in Pandas is used to group data by one or more columns
and perform aggregation operations.
Basic Syntax:-
df.groupby(by=column_name)[column_to_aggregate].aggregate_function()
Example 1: Grouping by Single Column
Suppose we have a DataFrame df with columns Name, Age, and Sales.
import pandas as pd
# Create DataFrame
data = {'Name': ['John', 'Mary', 'John', 'Mary', 'David'],
'Age': [25, 31, 25, 31, 42],
'Sales': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)
# Group by Name and calculate total Sales
grouped_df = df.groupby('Name')['Sales'].sum()
print(grouped_df)
Python for data science (3150713)
Output:-
Name
David 500
John 400
Mary 600
Name: Sales, dtype: int64
Example 2: Grouping by Multiple Columns
# Group by Name and Age, and calculate mean Sales
grouped_df = df.groupby(['Name', 'Age'])['Sales'].mean()
print(grouped_df)
Output:-
Name Age
David 42 500.0
John 25 200.0
Mary 31 300.0
Name: Sales, dtype: float64
Q-16. Explain join function in pandas with example.
The join() method inserts column(s) from another DataFrame, or Series.
Syntax
dataframe.join(other, on, how, lsuffix, rsuffix, sort)
import pandas as pd
data1 = {
"name": ["Sally", "Mary", "John"],
"age": [50, 40, 30]
}
data2 = {
"qualified": [True, False, False]
}
Python for data science (3150713)
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
newdf = df1.join(df2)
print(newdf)
OutPut:-
name age qualified
0 Sally 50 True
1 Mary 40 False
2 John 30 False
Q-17. Explain merge function in pandas with example.
The merge operation in Pandas merges two DataFrames based on their indexes
or a specified column.
The merge() in Pandas works similar to JOINs in SQL.
The syntax of the merge() method in Pandas is:
pd.merge(left, right, on=None, how='inner', left_on=None, right_on=None,
sort=False)
import pandas as pd
# create dataframes from the dictionaries
data1 = {
'EmployeeID' : ['E001', 'E002', 'E003', 'E004', 'E005'],
'Name' : ['John Doe', 'Jane Smith', 'Peter Brown', 'Tom Johnson', 'Rita Patel'],
'DeptID': ['D001', 'D003', 'D001', 'D002', 'D003'],
}
employees = pd.DataFrame(data1)
data2 = {
'DeptID': ['D001', 'D002', 'D003'],
'DeptName': ['Sales', 'HR', 'Admin']
}
departments = pd.DataFrame(data2)
Python for data science (3150713)
# merge dataframes employees and departments
merged_df = pd.merge(employees, departments)
# display DataFrames
print("Employees:")
print(employees)
print()
print("Departments:")
print(departments)
print()
print("Merged DataFrame:")
print(merged_df)
OutPut:-
Employees:
EmployeeID Name DeptID
0 E001 John Doe D001
1 E002 Jane Smith D003
2 E003 Peter Brown D001
3 E004 Tom Johnson D002
4 E005 Rita Patel D003
Departments:
DeptID DeptName
0 D001 Sales
1 D002 HR
2 D003 Admin
Merged DataFrame:
EmployeeID Name DeptID DeptName
0 E001 John Doe D001 Sales
1 E003 Peter Brown D001 Sales
2 E002 Jane Smith D003 Admin
3 E005 Rita Patel D003 Admin
4 E004 Tom Johnson D002 HR
Python for data science (3150713)
Q-18. Diffrentiate join and merge functions in pandas.
Feature join() merge()
Primary Use Combines DataFrames based Combines DataFrames based
on their index. on columns or index.
Default Performs a left join by Performs an inner join by
Behavior default. default.
Join on Works primarily with You can specify columns or
Columns DataFrame index. You can index to merge on, providing
join on columns but need to more flexibility.
reset the index.
Join Types Supports left, right, inner, and Supports left, right, inner,
outer joins. outer, and cross joins.
Flexibility Less flexible for joining on More flexible. You can
columns. Ideal for joining by specify multiple columns or a
index. combination of columns and
index to merge on.
Syntax Simpler for joining on index. More versatile, as it allows
joining on columns or
indices.
Key Parameteron=None, which defaults to on, left_on, right_on, which
the DataFrame index. can be columns or indices.
Handling You can use lsuffix and You can specify suffixes to
Overlapping rsuffix to differentiate handle overlapping column
Column Names overlapping column names in name
the two DataFrames.
Python for data science (3150713)
Q-19. Explain read_csv function in pandas with example.
CSV files are the Comma Separated Files. To access data from the CSV file, we
require a function read_csv() from Pandas that retrieves data in the form of the
data frame.
Syntax of Pandas read_csv
Here is the Pandas read CSV syntax with its parameters.
Syntax:
pd.read_csv(filepath_or_buffer, sep=’ ,’ , header=’infer’, index_col=None,
usecols=None, engine=None, skiprows=None, nrows=None)
Read CSV File using Pandas read_csv
import pandas as pd
df = pd.read_csv('data.csv')
print(df)
OutPut:-
Q-20 Explain read_excel function in pandas with example.
Pandas read_excel()
1. The Pandas module read_excel() function reads the Excel sheet data into
a DataFrame object.
2. An Excel sheet is a two-dimensional table, and the DataFrame object
represents the two-dimensional tabular view.
3. We can get the header details from the Excel sheet.
4. We can also specify the columns to be read from the Excel sheet.
5. The DataFrame object has various useful methods to convert the Excel
data into CSV, Dictionary, or JSON representation.
Python for data science (3150713)
Pandas read_excel() Example
import pandas
excel_data_df = pandas.read_excel('records.xlsx', sheet_name='Employees')
# print whole sheet data
print(excel_data_df)
OutPut:-
EmpID EmpName EmpRole
0 1 Pankaj CEO
1 2 David Lee Editor
2 3 Lisa Ray Author
Q-21. Explain Web Scrapping with Example using
Beautiful Soup library.
Web Scraping is the process of extracting data from websites. It allows you to
collect and analyze large amounts of data from websites that don’t have an official
API. Python has several libraries for web scraping, and one of the most commonly
used ones is Beautiful Soup.
What is Beautiful Soup?
Beautiful Soup is a Python library that makes it easy to scrape information from
web pages. It allows you to parse HTML and XML documents and extract useful
data from them. It works with a parser, such as the built-in Python parser or third-
party parsers like lxml.
Key Steps in Web Scraping using Beautiful Soup:
1. Send a request to the webpage to retrieve its content.
2. Parse the HTML content of the webpage.
3. Navigate and search the parsed HTML tree to extract the required
information.
Python for data science (3150713)
4. Extract and store the scraped data.
Example: Web Scraping with Beautiful Soup
Let’s scrape a sample website to extract some data.
Libraries Needed:
• requests: To make a request to the webpage.
• Beautiful Soup: To parse the HTML content.
# Importing necessary libraries
import requests
from bs4 import BeautifulSoup
# Step 1: Send a request to the website
url = "https://example.com" # Replace with the website you want to scrape
response = requests.get(url)
# Step 2: Parse the HTML content with Beautiful Soup
soup = BeautifulSoup(response.text, 'html.parser')
# step 3: Extract all paragraph tags
paragraphs = soup.find_all('p')
# Step 4: Print the extracted paragraph data
for p in paragraphs:
print(p.text)