0% found this document useful (0 votes)
36 views1 page

EDA Course Notes

Uploaded by

jasmineamjadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views1 page

EDA Course Notes

Uploaded by

jasmineamjadi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 1

### Exploratory Data Analysis (EDA) Overview

- **Purpose**: EDA helps to clean and review data, derive insights, descriptive
statistics, and correlations, generate hypotheses, and guide the next steps (e.g.,
model prep or discarding unusable data).
- **Data Import & Overview**:
- Import data with `pd.read_csv()` and view with `.head()`.
- Check column info (e.g., missing values, data types) using `.info()`.

### Key Methods for Initial Exploration


1. **Count Categories**: Use `.value_counts()` for categorical columns.
2. **Summary Statistics**: `.describe()` for count, mean, std deviation, min, max,
and quartiles of numerical columns.
3. **Histograms**:
- Use Seaborn (`sns.histplot()`) to visualize distributions.
- Adjust bin width with `binwidth` for better insights.

### Exercise: Initial Exploration of Unemployment Data


- Use:
```python
print(unemployment.head())
print(unemployment.info())
print(unemployment.describe())
```
- Key Insights: Data contains 182 countries with `country_code`, `country_name`,
`continent`, and unemployment rates (2010-2021).

### Validating Data Types


- **Detect Data Type Issues**:
- Use `.dtypes` to list column data types.
- Convert types as needed, e.g., `unemployment["2019"] =
unemployment["2019"].astype(float)`.

- **Filter & Validate**:


- Use `.isin()` to filter data, e.g., exclude "Oceania" using Boolean indexing.

### Data Range Validation


- **Boxplots for Range**: Create boxplots to see distributions, min/max, and
quartiles, e.g., `sns.boxplot(data=unemployment, x='2021', y='continent')`.

### Aggregation & Grouping with `.groupby()` and `.agg()`


- **Grouped Summary**:
- Calculate mean and standard deviation across categories using `.groupby()` and
`.agg()`.
- Named aggregations for clarity:
```python
continent_summary = unemployment.groupby("continent").agg(
mean_rate_2021=('2021', 'mean'),
std_rate_2021=('2021', 'std')
)
```

### Visualizing Categorical Summaries


- **Bar Plots with Confidence Intervals**:
- Visualize categorical averages and confidence intervals using Seaborn, e.g.,
```python
sns.barplot(data=unemployment, x='continent', y='2021')
```

You might also like