0% found this document useful (0 votes)
4 views11 pages

Introduction To Seaborn

This document provides a comprehensive introduction to Seaborn, a Python data visualization library, and outlines a data cleaning process essential for effective data analysis. It details various plotting techniques, including distribution, categorical, and matrix plots, along with code examples for each type. Additionally, it covers advanced customizations and settings for enhancing visualizations.

Uploaded by

anas jan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views11 pages

Introduction To Seaborn

This document provides a comprehensive introduction to Seaborn, a Python data visualization library, and outlines a data cleaning process essential for effective data analysis. It details various plotting techniques, including distribution, categorical, and matrix plots, along with code examples for each type. Additionally, it covers advanced customizations and settings for enhancing visualizations.

Uploaded by

anas jan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Introduction to Seaborn

Seaborn is a high‑level Python data visualization library built on top of Matplotlib.


It integrates tightly with Pandas DataFrames and makes it easier to create visually
appealing statistical plots. Key features include:

Built‑in Themes: Easily switch between themes such as "whitegrid" ,


"darkgrid" , or "ticks" .
Context Settings: Adjust the scale of plot elements with contexts like
"paper" , "notebook" , "talk" , and "poster" .
Color Palettes: Use pre‑defined palettes (e.g., "deep" , "muted" , "pastel" ,
"bright" , "dark" , "colorblind" ) or create custom palettes.

Part 1. Data Cleaning


In real‑world data analysis, cleaning your data is one of the most important steps. In
our simulated dataset, we deliberately introduced duplicates, missing values, and
inconsistent formatting to mimic real‐life challenges. Below is a step‑by‑step data
cleaning plan with code examples and explanations.

1. Inspect the Data


Before cleaning, understand the data structure and quality. Use these commands:

# Check data types, non-null counts, and summary statistics


print(df.info())
print(df.describe())
print(df.head(10))

2. Remove Duplicates
Since we intentionally added duplicates, remove them to ensure each record is unique:

# Identify the number of duplicated rows


print("Duplicated rows:", df.duplicated().sum())

# Remove duplicates in-place


df.drop_duplicates(inplace=True)
print("Dataset shape after dropping duplicates:", df.shape)

Explanation:

df.duplicated() : Returns a Boolean Series indicating duplicate rows.


df.drop_duplicates(inplace=True) : Removes duplicate rows from the dataset.

3. Handle Missing Values


Identify columns with missing values and decide how to handle them (imputation or
replacement).

# Check missing values per column


print(df.isnull().sum())

# Impute numerical columns with median and categorical ones with a placeholder
"Unknown"
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Experience'] = df['Experience'].fillna(df['Experience'].median())
df['Country'] = df['Country'].fillna("Unknown")

Explanation:

For columns like Age and Experience, the median is a good imputation choice
because it is less sensitive to outliers.
For categorical variables (like Country), we use a default placeholder.

4. Correct Inconsistent Data


Standardize text entries and remove unwanted characters.

# Standardize Gender entries


df['Gender'] = df['Gender'].replace({"M": "Male", "F": "Female"})

# Standardize Department: convert all entries to title case


df['Department'] = df['Department'].str.title()

# Clean Salary: remove "$" sign and convert to numeric


df['Salary'] = df['Salary'].replace('[\$,]', '', regex=True).astype(float)

# Standardize Feedback to title case (e.g., converting lower-case "satisfied" to


"Satisfied")
df['Feedback'] = df['Feedback'].str.title()

Explanation:

Gender: Mapping abbreviations to full terms ensures consistency.


Department: Using .str.title() converts strings like "sales" or "SALES" to
"Sales".
Salary: Removing currency symbols and converting the field to a float makes it
usable for numerical analysis.

5. Uniform Date Formats


Convert date strings in the JoinDate column into a uniform datetime format.

# Convert join dates to datetime using pandas functionality


df['JoinDate'] = pd.to_datetime(df['JoinDate'], infer_datetime_format=True,
errors='coerce')

Explanation:

pd.to_datetime(..., errors='coerce') : Converts valid dates, and invalid formats


become NaT (Not a Time), which can then be checked or imputed further if
needed.

6. Final Verification
Inspect the data again to ensure cleaning was successful.

print(df.info())
print(df.head(10))
Part 2. Extensive Seaborn Lecture
Assuming your data is now clean and stored in the DataFrame df , let’s move to a
detailed lecture on Seaborn. This lecture is divided into multiple sections covering
distribution plots, categorical plots, matrix plots, and advanced customizations.

Note: Before creating any plots, make sure to import the libraries and set your
theme:

import seaborn as sns


import matplotlib.pyplot as plt

# Set theme, context, and palette


sns.set_theme(style="whitegrid", context="talk", palette="deep")

Distribution Plots
Distribution plots help you visualize the distribution of a single variable or the
relationship between two variables.

1. Histograms with KDE (histplot)


A histogram shows the frequency distribution of a variable. When combined with a KDE
(Kernel Density Estimate), it displays a smooth density curve.

Example 1: Histogram with KDE for “Age”

sns.histplot(data=df, x="Age", kde=True, color="skyblue", edgecolor="black", bins=20)


plt.title("Distribution of Age with KDE")
plt.xlabel("Age")
plt.ylabel("Frequency")
plt.show()

Explanation:

bins=20 : Divides the data into 20 intervals.


edgecolor="black" : Outlines the histogram bars.
kde=True : Overlays the density curve.

Example 2: Basic Histogram without KDE for “Salary”

sns.histplot(data=df, x="Salary", color="salmon", bins=25)


plt.title("Histogram of Salary")
plt.xlabel("Salary")
plt.ylabel("Count")
plt.show()

Explanation:

No KDE overlay here; instead, the focus is solely on the histogram count.
Custom bin count and color improve clarity.
2. Joint Plot
The jointplot function shows the relationship between two variables as a scatter plot
and includes marginal histograms or KDE plots.

Example 1: Scatter Joint Plot for “Salary” vs. “Experience”

sns.jointplot(data=df, x="Salary", y="Experience", kind="scatter", color="purple")


plt.suptitle("Joint Plot: Salary vs. Experience", y=1.02)
plt.show()

Explanation:

Shows a scatter plot of Salary versus Experience with histograms on the


margins.

Example 2: Hex Bin Joint Plot

sns.jointplot(data=df, x="Salary", y="Experience", kind="hex", color="teal")


plt.suptitle("Hex Bin Joint Plot: Salary vs. Experience", y=1.02)
plt.show()

Explanation:

kind="hex" : Uses hexagonal binning to better visualize areas of high density,


especially useful when data points overlap significantly.

3. Pair Plot
The pairplot function builds a grid of plots to display pairwise relationships
between multiple variables. The diagonal typically shows univariate distributions.

Example 1: Pair Plot of Selected Variables

pair_data = df[['Age', 'Salary', 'Experience', 'Rating']].dropna()


sns.pairplot(pair_data, corner=True, diag_kind="hist", palette="coolwarm")
plt.suptitle("Pair Plot of Selected Variables", y=1.02)
plt.show()

Explanation:

corner=True : Displays only the lower triangle of plots to reduce redundancy.


diag_kind="hist" : Puts histograms along the diagonal.

Example 2: Pair Plot with Different Markers and Hue

sns.pairplot(pair_data, hue=df['Department'].dropna(), markers=["o", "s", "D", "^"],


palette="Set2")
plt.suptitle("Pair Plot with Department Hue", y=1.02)
plt.show()

Explanation:
Hue: Colors points by Department , adding a categorical dimension.
Markers: Uses different markers for various hue levels.

4. Rug Plot and KDE Plot


A rug plot shows individual data points as small vertical lines along an axis. A KDE
plot shows a smooth estimate of a variable’s density.

Example 1: Rug Plot for “Age”

sns.rugplot(data=df, x="Age", color="darkred")


plt.title("Rug Plot of Age")
plt.xlabel("Age")
plt.show()

Explanation:

Visualizes individual data points along the x‑axis as tick marks.

Example 2: Standalone KDE Plot for “Age”

sns.kdeplot(data=df, x="Age", fill=True, color="green", bw_adjust=1.5)


plt.title("KDE Plot for Age")
plt.xlabel("Age")
plt.ylabel("Density")
plt.show()

Explanation:

fill=True : Fills the area under the density curve.


bw_adjust=1.5 : Adjusts the KDE smoothing.

Categorical Data Plots


Categorical plots are useful for comparing distributions or aggregate statistics
across different groups.

1. Catplot
The catplot function (formerly known as factorplot ) is a figure-level function that
supports multiple plot types through the kind parameter.

Example 1: Box Plot via Catplot for “Salary” by “Department”

sns.catplot(data=df, x="Department", y="Salary", kind="box", height=5, aspect=1.5,


palette="pastel")
plt.title("Box Plot of Salary by Department")
plt.xlabel("Department")
plt.ylabel("Salary")
plt.show()
Explanation:

kind="box" : Creates a box plot using the categorical plot interface.


height and aspect : Adjust the size and aspect ratio.

Example 2: Violin Plot via Catplot for “Experience” by “Department”

sns.catplot(data=df, x="Department", y="Experience", kind="violin", height=5,


aspect=1.5, palette="muted")
plt.title("Violin Plot of Experience by Department")
plt.xlabel("Department")
plt.ylabel("Years of Experience")
plt.show()

Explanation:

kind="violin" : Displays a combination of box plot statistics with a KDE for


density estimation.

2. Box Plot
A box plot details the median, quartiles, and outliers of a variable across
categories.

Example 1: Basic Box Plot for “Salary” by “Department”

sns.boxplot(data=df, x="Department", y="Salary", palette="Set3")


plt.title("Box Plot of Salary by Department")
plt.xlabel("Department")
plt.ylabel("Salary")
plt.show()

Explanation:

Offers a straightforward view of salary distribution across departments.

Example 2: Box Plot with Hue

sns.boxplot(data=df, x="Department", y="Salary", hue="Gender", palette="coolwarm")


plt.title("Box Plot of Salary by Department and Gender")
plt.xlabel("Department")
plt.ylabel("Salary")
plt.legend(title="Gender")
plt.show()

Explanation:

Hue: Splits the box plot by Gender so you can compare subgroups within each
department.

3. Violin Plot
Violin plots display the distribution’s density in addition to summary statistics
similar to box plots.

Example 1: Basic Violin Plot for “Experience” by “Department”

sns.violinplot(data=df, x="Department", y="Experience", palette="muted")


plt.title("Violin Plot of Experience by Department")
plt.xlabel("Department")
plt.ylabel("Years of Experience")
plt.show()

Explanation:

Combines box plot features with a KDE to show full distribution profiles.

Example 2: Split Violin Plot by “Gender”

sns.violinplot(data=df, x="Department", y="Experience", hue="Gender", split=True,


palette="Pastel1")
plt.title("Split Violin Plot of Experience by Department and Gender")
plt.xlabel("Department")
plt.ylabel("Years of Experience")
plt.legend(title="Gender")
plt.show()

Explanation:

split=True : Divides the violin plot into two halves (one per hue level),
making subgroup comparisons clearer.

4. Strip Plot and Swarm Plot


These plots display individual data points and are ideal for showing distributions
without aggregation.

Strip Plot

Example 1: Basic Strip Plot for “Rating” by “Feedback”

sns.stripplot(data=df, x="Feedback", y="Rating", jitter=True, palette="cool")


plt.title("Strip Plot of Rating by Feedback")
plt.xlabel("Feedback")
plt.ylabel("Rating")
plt.show()

Explanation:

jitter=True : Adds horizontal noise to prevent overlapping points.

Example 2: Strip Plot with Adjusted Marker Size

sns.stripplot(data=df, x="Feedback", y="Rating", jitter=True, size=8, color="navy")


plt.title("Strip Plot with Larger Markers")
plt.xlabel("Feedback")
plt.ylabel("Rating")
plt.show()

Explanation:

Adjusting the marker size makes individual observations more visible.

Swarm Plot

Example 1: Basic Swarm Plot for “Rating” by “Feedback”

sns.swarmplot(data=df, x="Feedback", y="Rating", palette="deep")


plt.title("Swarm Plot of Rating by Feedback")
plt.xlabel("Feedback")
plt.ylabel("Rating")
plt.show()

Explanation:

Automatically arranges points to avoid overlaps without random jitter.

Example 2: Swarm Plot with Custom Marker (Using a Different Palette)

sns.swarmplot(data=df, x="Feedback", y="Rating", palette="viridis")


plt.title("Swarm Plot with Viridis Palette")
plt.xlabel("Feedback")
plt.ylabel("Rating")
plt.show()

Explanation:

Demonstrates the use of an alternative color palette.

5. Bar Plot and Count Plot


Bar plots aggregate statistics across categories, while count plots display the
frequency of each category.

Bar Plot

Example 1: Bar Plot of Mean “Salary” by “Department”

sns.barplot(data=df, x="Department", y="Salary", estimator=np.mean, palette="rocket")


plt.title("Average Salary by Department")
plt.xlabel("Department")
plt.ylabel("Average Salary")
plt.show()

Explanation:

Estimator set to np.mean : Computes the average salary per department.

Example 2: Bar Plot with Confidence Intervals


sns.barplot(data=df, x="Department", y="Rating", palette="coolwarm")
plt.title("Average Rating by Department with Confidence Intervals")
plt.xlabel("Department")
plt.ylabel("Average Rating")
plt.show()

Explanation:

Confidence intervals (shown by default) help indicate the reliability of the


aggregated statistic.

Count Plot

Example 1: Count Plot for “Department”

sns.countplot(data=df, x="Department", palette="viridis")


plt.title("Count of Records by Department")
plt.xlabel("Department")
plt.ylabel("Count")
plt.show()

Example 2: Count Plot for “Feedback”

sns.countplot(data=df, x="Feedback", palette="Set2")


plt.title("Count of Records by Feedback")
plt.xlabel("Feedback")
plt.ylabel("Count")
plt.show()

Explanation:

Count plots simply show how many observations belong to each category.

Matrix Plots: Heatmap


Heatmaps are effective for visualizing matrix-like data, such as correlation matrices
or pivot tables.

Example 1: Correlation Matrix Heatmap

# Compute correlation matrix for numeric features (dropping any remaining missing
values)
corr_matrix = df[['Age', 'Salary', 'Experience', 'Rating']].dropna().corr()

sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", linewidths=0.5,


linecolor="white")
plt.title("Heatmap of Correlation Matrix")
plt.show()

Explanation:

annot=True : Displays the correlation values in each cell.


linewidths=0.5 : Adds white borders between cells for clarity.

Example 2: Pivot Table Heatmap

# Create a synthetic pivot table from sample sales data


sales_data = {
'Store': np.random.choice(['Store A', 'Store B', 'Store C'], 100),
'Product': np.random.choice(['Product 1', 'Product 2', 'Product 3'], 100),
'Sales': np.random.randint(100, 500, 100)
}
sales_df = pd.DataFrame(sales_data)
pivot_table = sales_df.pivot_table(index='Store', columns='Product', values='Sales',
aggfunc='sum')

sns.heatmap(pivot_table, annot=True, fmt="d", cmap="YlGnBu", cbar=True)


plt.title("Sales Heatmap by Store and Product")
plt.show()

Explanation:

Uses a pivot table format to summarize sales and visualize the totals via a
heatmap.

Advanced Customizations and Settings


Seaborn’s flexibility allows you to customize nearly every aspect of your plots. Here
are some commonly used customizations:

Figure Size:
Adjust the overall figure size with Matplotlib:

plt.figure(figsize=(10, 6))

Titles and Labels:


Customize with:

plt.title("Your Title")
plt.xlabel("X-axis Label")
plt.ylabel("Y-axis Label")

Legend Options:
Use the legend function with a title:

plt.legend(title="Your Title")

Custom Color Palettes:


Generate or set custom palettes:

custom_palette = sns.color_palette("husl", 8)
# Alternatively, use:
sns.light_palette("seagreen")
sns.dark_palette("navy")
Context Settings:
Change scale for larger visuals:

sns.set_context("poster") # Larger labels and thicker lines for presentations

Saving Figures:
Save your figures with a high resolution:

plt.savefig("your_plot.png", dpi=300, bbox_inches="tight")

Interactive Plots in Notebooks:


If using a Jupyter Notebook, include:

%matplotlib inline

Summary
This extensive lecture on Seaborn has covered the following:

1. Data Cleaning:

Inspection of data structure and quality.


Removing duplicates.
Imputing missing values.
Standardizing inconsistent data.
Uniformly converting dates.

2. Seaborn Introduction & Setup:

How to set themes, contexts, and custom color palettes.

3. Distribution Plots:

Histograms (with and without KDE).


Joint plots (scatter and hex-bin).
Pair plots (with hue and different markers).
Rug and standalone KDE plots.

4. Categorical Plots:

Catplot for box and violin plots.


Standalone box plots (with and without hue).
Violin plots (including split variations).
Strip and swarm plots (detailed with adjustments).
Bar plots and count plots for aggregated insights.

5. Matrix Plots:

Heatmaps for correlation and pivot table data visualization.

6. Advanced Customizations:

Adjusting figure sizes, titles, labels, legend options, and saving


figures.

You might also like