Pandas – Detailed Notes
Introduction to Pandas
• Pandas is a fast, powerful, flexible, and easy-to-use open-source data analysis and data
manipulation Python library.
• Built on top of NumPy.
Main Data Structures
1. Series – 1D labeled array capable of holding any data type (integers, strings, floating point
numbers, Python objects, etc.).
2. DataFrame – 2D labeled data structure with columns of potentially different types (like a
spreadsheet or SQL table).
3. Index – Immutable sequence used for indexing and aligning data.
Key Features
• Handling missing data.
• Size mutability – columns can be inserted and deleted from DataFrame.
• Automatic and explicit data alignment.
• Powerful group by functionality for performing split-apply-combine operations.
• High performance merging and joining of data sets.
• Time series functionality.
Commonly Used Functions
• Data Loading: read_csv(), read_excel(), read_sql(), read_json(), read_html().
• Data Export: to_csv(), to_excel(), to_sql(), to_json().
• Data Selection: loc[], iloc[], at[], iat[].
• Data Cleaning: dropna(), fillna(), replace().
• Data Transformation: apply(), map(), astype().
• Grouping: groupby(), aggregate(), transform().
• Combining Data: merge(), join(), concat().
• Reshaping: pivot_table(), stack(), unstack(), melt().
• Descriptive Statistics: describe(), mean(), median(), std(), value_counts().
Indexing and Selection
• label-based selection with loc.
• integer location-based selection with iloc.
• Boolean indexing for filtering data.
Handling Missing Data
• dropna(): Drop missing values.
• fillna(): Fill missing values with a specified value or method.
• interpolate(): Fill missing values using interpolation.
Time Series and Date Functionality
• pandas has robust support for working with time series data, including date_range(),
to_datetime(), resample(), shifting, and rolling windows.
Visualization
• [Link]() uses matplotlib internally to plot data directly from pandas.
Integration with Other Libraries
• Works seamlessly with NumPy, Matplotlib, SQLAlchemy, openpyxl, PyArrow, and more.
Summary
Pandas is the go-to library in Python for data manipulation, cleaning, analysis, and preparation. It
integrates with numerous libraries for visualization and file handling, making it an essential tool for
data scientists and analysts.