Step-by-Step Data Cleaning in Python for
[Link]
This document outlines a step-by-step process for cleaning a movie dataset ([Link])
using Python with Pandas and NumPy. Each step includes a description and
Python code.
1 Load and Inspect the Dataset
Loading the dataset and inspecting its structure helps identify issues like missing
values or incorrect data types.
1 import pandas as pd
2
3 # Load [Link]
4 dataset = pd.read_csv(’[Link]’)
5
6 # Inspect the dataset
7 print([Link]()) # First 5 rows
8 print([Link]()) # Data types and missing values
9 print([Link]()) # Summary statistics
2 Handle Missing Values
Missing values (NaN/None) can skew analysis. Drop rows with missing criti-
cal columns (e.g., title) and impute numerical columns (e.g., rating) with the
mean.
1 # Check missing values
2 print([Link]().sum())
3
4 # Drop rows with missing ’title’
5 [Link](subset=[’title’], inplace=True)
6
7 # Fill missing ’rating’ with mean
8 dataset[’rating’].fillna(dataset[’rating’].mean(), inplace=True)
9 print([Link]().sum())
1
3 Remove Duplicates
Duplicates can bias results. Remove duplicate rows based on key columns like
title and year.
1 # Check for duplicates
2 print([Link]().sum())
3
4 # Remove duplicates
5 dataset.drop_duplicates(subset=[’title’, ’year’], inplace=True)
6 print([Link])
4 Correct Data Types
Ensure columns have appropriate types (e.g., year as integer, released ateasdatetime).
1 # Convert ’year’ to integer
2 dataset[’year’] = dataset[’year’].astype(’Int64’)
3
4 # Convert ’release_date’ to datetime
5 dataset[’release_date’] = pd.to_datetime(dataset[’release_date’],
errors=’coerce’)
6 print([Link])
5 Handle Inconsistent Data
Standardize text data (e.g., genre) to fix typos or case inconsistencies.
1 # Standardize ’genre’
2 dataset[’genre’] = dataset[’genre’].[Link]().[Link]()
3
4 # Fix inconsistent values
5 dataset[’genre’] = dataset[’genre’].replace({’scifi’: ’sci-fi’, ’
comedy ’: ’comedy’})
6 print(dataset[’genre’].value_counts())
6 Handle Outliers
Detect and remove outliers in numerical columns (e.g., rating) using
the Interquartile Range (IQR) method.
1 import numpy as np
2
3 # Calculate IQR for ’rating’
4 Q1, Q3 = dataset[’rating’].quantile([0.25, 0.75])
5 IQR = Q3 - Q1
6 lower_bound, upper_bound = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
7
2
8 # Remove outliers
9 dataset = dataset[(dataset[’rating’] >= lower_bound) & (dataset[’
rating’] <= upper_bound)]
10 print([Link]())
7 Encode Categorical Variables
Convert categorical columns (e.g., genre) to numerical format using
one-hot encoding.
1 # One-hot encode ’genre’
2 dataset = pd.get_dummies(dataset, columns=[’genre’], prefix=’genre’)
3 print([Link]())
8 Clean Text Data
Remove unwanted characters from text columns like title using regular
expressions.
1 import re
2
3 # Clean ’title’
4 dataset[’title’] = dataset[’title’].apply(lambda x: [Link](r’[^a-zA-
Z0-9\s]’, ’’, str(x)))
5 print(dataset[’title’].head())
9 Filter Irrelevant Data
Remove irrelevant rows (e.g., movies before 2000) or columns (e.g.,
comments).
1 # Keep movies from 2000 or later
2 dataset = dataset[dataset[’year’] >= 2000]
3
4 # Drop irrelevant column
5 [Link](columns=[’comments’], inplace=True, errors=’ignore’)
6 print([Link]())
10 Validate and Save
Verify the cleaned dataset and save it to a new CSV file.
1 # Final inspection
2 print([Link]())
3 print([Link]().sum())
4
3
5 # Save cleaned dataset
6 dataset.to_csv(’movies_cleaned.csv’, index=False)
7 print(”Cleaned dataset saved as ’movies_cleaned.csv’”)
11 Notes
• Ensure [Link] is in the working directory or provide the
full path.
• Use [Link]() to check the working directory.
• Adjust steps based on specific dataset issues (e.g., unique
columns or formats).