Pandas Notes - Data Handling & Analysis
Reading Data
- Use `pd.read_csv('file.csv')` to read CSV files.
- Use `pd.read_excel('file.xlsx')` to read Excel files.
- Use `df.head()` to view the first 5 rows.
- Use `df.tail()` to view the last 5 rows.
- Use `df.info()` to see data types and non-null counts.
Example:
df = pd.read_csv('data.csv')
Cleaning Data
- Use `df.isnull().sum()` to check missing values.
- Fill missing data: `df.fillna(value)`.
- Drop missing rows: `df.dropna()`.
- Rename columns: `df.rename(columns={'old':'new'})`.
Example:
df['age'].fillna(df['age'].mean(), inplace=True)
Filtering Data
- Single condition: `df[df['age'] > 30]`
- Multiple conditions: `df[(df['age'] > 25) & (df['marks'] > 80)]`
- Equality: `df[df['name'] == 'Bob']`
- `isin()`: `df[df['name'].isin(['Alice', 'David'])]`
- String match: `df[df['name'].str.startswith('A')]`
Example:
df.query('age > 30 and marks < 90')
Analyzing Data
- `df.describe()` gives statistical summary.
- Column stats: `mean()`, `max()`, `min()`, `mode()`.
- `value_counts()` for frequency count.
- `df.groupby('col')['val'].mean()` for grouped mean.
- `df.corr()` for correlation.
Example:
df.groupby('department')['marks'].agg(['min', 'max', 'mean'])
Grouping Data
- Use `groupby()` to group and aggregate.
- Average marks: `df.groupby('department')['marks'].mean()`
- Multiple stats: `agg(['min', 'max'])`
- Group by multiple: `df.groupby(['dept', 'name'])`
Pandas Notes - Data Handling & Analysis
- Reset index: `reset_index()` to flatten result.
Example:
df.groupby('department')['marks'].sum().reset_index()
Merging Data
- `pd.merge(df1, df2, on='id')` for inner join.
- `how='left'`, `'right'`, `'outer'` for other joins.
- Merge on multiple keys: `on=['id', 'name']`.
Example:
pd.merge(students, marks, on='id', how='left')
Exporting Data
- To CSV: `df.to_csv('file.csv', index=False)`
- To Excel: `df.to_excel('file.xlsx', index=False)`
- To JSON: `df.to_json('file.json')`
Example:
df.to_csv('cleaned_data.csv', index=False)