Day64 - Pandas Interview Questions | PDF | Data Analysis | Data
0% found this document useful (0 votes)
130 views

Day64 - Pandas Interview Questions

Pandas is a popular Python library for data manipulation and analysis. It provides two primary data structures: Series for one-dimensional labeled data and DataFrame for two-dimensional labeled data similar to a spreadsheet. DataFrame allows for flexible and powerful data analysis through its tabular structure with labeled rows and columns, ability to handle heterogeneous data types, and built-in functions for cleaning, transforming, exploring and visualizing data. Loc and iloc are two methods for indexing and selecting data from a DataFrame based on labels and integer positions respectively. Loc uses actual labels while iloc uses integer indices.

Uploaded by

tikar69314
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views

Day64 - Pandas Interview Questions

Pandas is a popular Python library for data manipulation and analysis. It provides two primary data structures: Series for one-dimensional labeled data and DataFrame for two-dimensional labeled data similar to a spreadsheet. DataFrame allows for flexible and powerful data analysis through its tabular structure with labeled rows and columns, ability to handle heterogeneous data types, and built-in functions for cleaning, transforming, exploring and visualizing data. Loc and iloc are two methods for indexing and selecting data from a DataFrame based on labels and integer positions respectively. Loc uses actual labels while iloc uses integer indices.

Uploaded by

tikar69314
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

11/24/23, 11:51 AM Day64 - Pandas Interview Questions

Pandas Interview Questions


1. What is Pandas, and why is it popular in data
analyst?
Pandas is a popular open-source data manipulation and analysis library for the Python
programming language. It provides data structures for efficiently storing and manipulating
large datasets and tools for reading and writing data in various formats. The two primary
data structures in Pandas are:

1. Series: A one-dimensional labeled array capable of holding any data type.


2. DataFrame: A two-dimensional labeled data structure with columns that can be of
different types.

Pandas is widely used in the field of data analysis for several reasons:

1. Ease of Use: Pandas provides a simple and intuitive syntax for data manipulation. Its
data structures are designed to be easy to use and interact with.

2. Data Cleaning and Transformation: Pandas makes it easy to clean and transform data.
It provides functions for handling missing data, reshaping data, merging and joining
datasets, and performing various data transformations.

3. Data Exploration: Pandas allows data analysts to explore and understand their datasets
quickly. Descriptive statistics, data summarization, and various methods for slicing and
dicing data are readily available.

4. Data Input/Output: Pandas supports reading and writing data in various formats,
including CSV, Excel, SQL databases, and more. This makes it easy to work with data
from different sources.

5. Integration with Other Libraries: Pandas integrates well with other popular data
science and machine learning libraries in Python, such as NumPy, Matplotlib, and Scikit-
learn. This allows for a seamless workflow when performing more complex analyses.

6. Time Series Analysis: Pandas provides excellent support for time series data, including
tools for date range generation, frequency conversion, and resampling.

7. Community and Documentation: Pandas has a large and active community, which
means there is extensive documentation and a wealth of online resources, tutorials, and
forums available for users to seek help and guidance.

8. Open Source: Being an open-source project, Pandas allows users to contribute to its
development and improvement. This collaborative nature has helped Pandas evolve and
stay relevant in the rapidly changing landscape of data analysis and data science.

In summary, Pandas is popular in data analysis because it simplifies the process of working
with structured data, provides powerful tools for data manipulation, and has become a

file:///C:/Users/disha/Downloads/Day64 - Pandas Interview Questions.html 1/5


11/24/23, 11:51 AM Day64 - Pandas Interview Questions

standard tool in the Python ecosystem for data analysis tasks.

2. What is DataFrame in Pandas?


In Pandas, a DataFrame is a two-dimensional, tabular data structure with labeled axes (rows
and columns). It is similar to a spreadsheet or SQL table, where data can be stored in rows
and columns. The key features of a DataFrame include:

1. Tabular Structure: A DataFrame is a two-dimensional table with rows and columns.


Each column can have a different data type, such as integer, float, string, or even
custom types.

2. Labeled Axes: Both rows and columns of a DataFrame are labeled. This means that
each row and each column has a unique label or index associated with it, allowing for
easy access and manipulation of data.

3. Flexible Size: DataFrames can grow and shrink in size. You can add or remove rows and
columns as needed.

4. Heterogeneous Data Types: Different columns in a DataFrame can have different data
types. For example, one column might contain integers, while another column contains
strings.

5. Data Alignment: When performing operations on DataFrames, Pandas automatically


aligns the data based on labels, making it easy to work with data even if it is not
perfectly clean or aligned.

6. Missing Data Handling: DataFrames can handle missing data gracefully. Pandas
provides methods for detecting, removing, or filling missing values.

7. Powerful Operations: DataFrames support a wide range of operations, including


arithmetic operations, aggregation, filtering, merging, and reshaping. This makes it a
powerful tool for data analysis and manipulation.

In [ ]: import pandas as pd

# Creating a DataFrame from a dictionary


data = {'Name': ['John', 'Jane', 'Bob'],
'Age': [28, 24, 22],
'City': ['New York', 'San Francisco', 'Los Angeles']}

df = pd.DataFrame(data)

# Displaying the DataFrame


print(df)

Name Age City


0 John 28 New York
1 Jane 24 San Francisco
2 Bob 22 Los Angeles

In this example, each column represents a different attribute (Name, Age, City), and each
row represents a different individual. The DataFrame provides a convenient way to work with
this tabular data in a structured and labeled format.
file:///C:/Users/disha/Downloads/Day64 - Pandas Interview Questions.html 2/5
11/24/23, 11:51 AM Day64 - Pandas Interview Questions

3. What is diffrence between loc and iloc in


pandas?
In Pandas, loc and iloc are two different methods used for indexing and selecting data
from a DataFrame. They are primarily used for label-based and integer-location-based
indexing, respectively. Here's the key difference between loc and iloc :

1. loc (Label-based Indexing):


The loc method is used for selection by label.
It allows you to access a group of rows and columns by labels or a boolean array.
The syntax is df.loc[row_label, column_label] or df.loc[row_label] for
selecting entire rows.
The labels used with loc are the actual labels of the index or column names, not
the integer position.
Inclusive slicing is supported with loc , meaning both the start and stop index are
included in the selection.

In [ ]: import pandas as pd

# Assuming 'df' is our DataFrame


selected_data = df.loc[2:4, 'Name':'City']
selected_data

Out[ ]: Name Age City

2 Bob 22 Los Angeles

1. iloc (Integer-location based Indexing):


The iloc method is used for selection by position.
It allows you to access a group of rows and columns by integer positions.
The syntax is df.iloc[row_index, column_index] or df.iloc[row_index]
for selecting entire rows.
The indices used with iloc are integer-based, meaning you specify the position
of the rows and columns based on their numerical order (0-based indexing).
Exclusive slicing is used with iloc , meaning the stop index is not included in the
selection.

In [ ]: import pandas as pd

# Assuming 'df' is our DataFrame


selected_data = df.iloc[2:5, 0:3]
selected_data

Out[ ]: Name Age City

2 Bob 22 Los Angeles

In summary, if you want to select data based on the labels of rows and columns, you use
loc . If you prefer to select data based on the integer positions of rows and columns, you

file:///C:/Users/disha/Downloads/Day64 - Pandas Interview Questions.html 3/5


11/24/23, 11:51 AM Day64 - Pandas Interview Questions

use iloc . The choice between them depends on whether you are working with labeled or
integer-based indexing.

4. How do you filter rows in a dataframe based on


condition?
To filter rows in a DataFrame based on a condition, you can use boolean indexing. Boolean
indexing involves creating a boolean Series that represents the condition you want to apply
and then using that boolean Series to filter the rows of the DataFrame. Here's a step-by-step
guide:

Assuming you have a DataFrame named df, and you want to filter rows based on a
condition, let's say a condition on the 'Age' column:

In [ ]: import pandas as pd

# Assuming 'df' is your DataFrame


data = {'Name': ['John', 'Jane', 'Bob', 'Alice'],
'Age': [28, 24, 22, 30],
'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)

# Condition for filtering (e.g., selecting rows where Age is greater than 25)
condition = df['Age'] > 25

# Applying the condition to filter rows


filtered_df = df[condition]

# Displaying the filtered DataFrame


print(filtered_df)

Name Age City


0 John 28 New York
3 Alice 30 Chicago

5. How do you handle missing values in data with


the help of pandas?
Handling missing values is a crucial step in the data cleaning process. Pandas provides
several methods for working with missing data in a DataFrame. Here are some common
techniques:

1. Detecting Missing Values:


The isnull() method can be used to detect missing values in a DataFrame. It
returns a DataFrame of the same shape, where each element is a boolean
indicating whether the corresponding element in the original DataFrame is missing.
The notnull() method is the opposite of isnull() and returns True for
non-missing values.

In [ ]: import pandas as pd

# Assuming 'df' is your DataFrame


missing_values = df.isnull()

file:///C:/Users/disha/Downloads/Day64 - Pandas Interview Questions.html 4/5


11/24/23, 11:51 AM Day64 - Pandas Interview Questions

1. Dropping Missing Values:


The dropna() method can be used to remove rows or columns containing
missing values.
The thresh parameter can be used to specify a threshold for the number of non-
null values required to keep a row or column.

In [ ]: # Drop rows with any missing values


df_no_missing_rows = df.dropna()

# Drop columns with any missing values


df_no_missing_cols = df.dropna(axis=1)

# Drop rows with at least 3 non-null values


df_thresh = df.dropna(thresh=3)

1. Filling Missing Values:


The fillna() method can be used to fill missing values with a specified constant
or using various filling methods like forward fill or backward fill.
Commonly, mean or median values are used to fill missing values in numerical
columns.

In [ ]: # Fill missing values with a constant


df_fill_constant = df.fillna(0)

# Fill missing values with the mean of the column


df_fill_mean = df.fillna(df.mean())

# Forward fill missing values (use the previous value)


df_ffill = df.fillna(method='ffill')

# Backward fill missing values (use the next value)


df_bfill = df.fillna(method='bfill')

file:///C:/Users/disha/Downloads/Day64 - Pandas Interview Questions.html 5/5

You might also like