✅ 1. What is Pandas in Python?
Pandas is an open-source Python library used for data analysis and manipulation. It
provides easy-to-use data structures like Series (1D) and DataFrame (2D), making it simple
to work with structured (tabular) data.
✅ 2. What are the key data structures in Pandas?
Series: One-dimensional labeled array (like a single column).
DataFrame: Two-dimensional table with labeled rows and columns (like a
spreadsheet or SQL table).
✅ 3. How is Pandas different from NumPy?
NumPy is good for numerical data and arrays.
Pandas builds on NumPy and adds labels, indexing, and easier handling of tabular
(structured) data.
✅ 4. What is the role of Pandas in data engineering?
Pandas is widely used in the data cleaning, exploration, and transformation stages of ETL
pipelines. It helps data engineers prepare raw data before loading it into data warehouses
or big data systems.
✅ 5. What are the advantages of using Pandas?
Easy handling of missing data
Powerful data filtering and transformation
Built-in support for time-series data
Fast operations with NumPy backend
Ability to read/write many file formats (CSV, JSON, Excel, SQL)
✅ 6. What is data alignment in Pandas?
Data alignment refers to how Pandas automatically matches data by labels (index/column
names) when performing operations between Series or DataFrames. This ensures consistent
results even when labels don’t match perfectly.
✅ 7. Explain Pandas indexing.
Pandas allows label-based indexing using .loc[] and position-based indexing using .iloc[].
Indexing helps select and manipulate rows and columns efficiently.
✅ 8. What is broadcasting in Pandas?
Answer:
Broadcasting is the process where a smaller array (like a single value or row) is
automatically expanded to match the shape of a larger DataFrame during operations. It
simplifies operations like adding a constant to all rows.
✅ 9. What is vectorization in Pandas?
Vectorization allows you to apply operations directly to entire Series or DataFrames
without writing explicit loops. It is faster and more efficient than looping through rows.
✅ 10. What is chaining and why should it be avoided in Pandas?
Chaining refers to using multiple indexing operations together (e.g., df[df['col'] > 0]
['other_col']). It can cause unexpected behavior or warnings.
Use .loc[] or .iloc[] for clearer and safer indexing.
✅ 11. How does Pandas handle missing data?
Pandas provides tools like:
isnull() and notnull() to detect nulls
dropna() to remove missing data
fillna() to fill missing values with a specified value or method
✅ 12. What is the difference between shallow copy and deep copy in Pandas?
Shallow copy (df2 = df) means both variables refer to the same data. Changes in one
affect the other.
Deep copy (df2 = df.copy()) creates an independent copy. Changes in one don’t affect
the other.
✅ 13. What is the use of groupby() in Pandas?
groupby() is used to split data into groups, apply functions (like sum, mean), and combine
the results. It’s helpful for summarizing and analyzing data by category.
✅ 14. What are categorical variables in Pandas?
Categorical variables are columns that contain a fixed number of distinct values (like 'Male',
'Female', 'Yes', 'No'). Pandas supports category data type to save memory and optimize
operations.
✅ 15. How does Pandas improve performance with large datasets?
Uses NumPy under the hood for fast computations
Offers vectorized operations
Supports categorical data types to reduce memory
Can chunk data for processing large files
Works well with Dask or modin for scaling to bigger data
✅ 16. What are common file formats supported by Pandas?
CSV (read_csv)
Excel (read_excel)
JSON (read_json)
SQL (read_sql)
Parquet (read_parquet)
HDF5, clipboard, and more
✅ 17. What is time-series data in Pandas and how is it handled?
Time-series data is data indexed by date or time.
Pandas has strong support for:
Datetime indexing
Resampling
Rolling windows
Time-based filtering
✅ 18. Can Pandas work with big data?
Pandas works best with small to medium-sized data that fits in memory.
For big data:
Use Dask, Vaex, or modin
Or move to Spark DataFrames
✅ 19. What is the difference between apply(), map(), and applymap()?
map() → works on Series (single column)
apply() → works on Series or DataFrame
applymap() → applies a function to each element of a DataFrame (cell-wise)