0% found this document useful (0 votes)
45 views4 pages

Cleaning

This document provides a comprehensive guide for cleaning a movie dataset (movies.csv) using Python with Pandas and NumPy. It details steps including loading the dataset, handling missing values, removing duplicates, correcting data types, and encoding categorical variables, among others. The final step emphasizes validating the cleaned dataset and saving it to a new CSV file.

Uploaded by

Awan Adhikari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views4 pages

Cleaning

This document provides a comprehensive guide for cleaning a movie dataset (movies.csv) using Python with Pandas and NumPy. It details steps including loading the dataset, handling missing values, removing duplicates, correcting data types, and encoding categorical variables, among others. The final step emphasizes validating the cleaned dataset and saving it to a new CSV file.

Uploaded by

Awan Adhikari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Step-by-Step Data Cleaning in Python for

[Link]

This document outlines a step-by-step process for cleaning a movie dataset ([Link])
using Python with Pandas and NumPy. Each step includes a description and
Python code.

1 Load and Inspect the Dataset

Loading the dataset and inspecting its structure helps identify issues like missing
values or incorrect data types.
1 import pandas as pd
2

3 # Load [Link]
4 dataset = pd.read_csv(’[Link]’)
5

6 # Inspect the dataset


7 print([Link]()) # First 5 rows
8 print([Link]()) # Data types and missing values
9 print([Link]()) # Summary statistics

2 Handle Missing Values

Missing values (NaN/None) can skew analysis. Drop rows with missing criti-
cal columns (e.g., title) and impute numerical columns (e.g., rating) with the
mean.
1 # Check missing values
2 print([Link]().sum())
3

4 # Drop rows with missing ’title’


5 [Link](subset=[’title’], inplace=True)
6

7 # Fill missing ’rating’ with mean


8 dataset[’rating’].fillna(dataset[’rating’].mean(), inplace=True)
9 print([Link]().sum())

1
3 Remove Duplicates

Duplicates can bias results. Remove duplicate rows based on key columns like
title and year.
1 # Check for duplicates
2 print([Link]().sum())
3

4 # Remove duplicates
5 dataset.drop_duplicates(subset=[’title’, ’year’], inplace=True)
6 print([Link])

4 Correct Data Types

Ensure columns have appropriate types (e.g., year as integer, released ateasdatetime).
1 # Convert ’year’ to integer
2 dataset[’year’] = dataset[’year’].astype(’Int64’)
3

4 # Convert ’release_date’ to datetime


5 dataset[’release_date’] = pd.to_datetime(dataset[’release_date’],
errors=’coerce’)
6 print([Link])

5 Handle Inconsistent Data

Standardize text data (e.g., genre) to fix typos or case inconsistencies.


1 # Standardize ’genre’
2 dataset[’genre’] = dataset[’genre’].[Link]().[Link]()
3

4 # Fix inconsistent values


5 dataset[’genre’] = dataset[’genre’].replace({’scifi’: ’sci-fi’, ’
comedy ’: ’comedy’})
6 print(dataset[’genre’].value_counts())

6 Handle Outliers

Detect and remove outliers in numerical columns (e.g., rating) using


the Interquartile Range (IQR) method.
1 import numpy as np
2

3 # Calculate IQR for ’rating’


4 Q1, Q3 = dataset[’rating’].quantile([0.25, 0.75])
5 IQR = Q3 - Q1
6 lower_bound, upper_bound = Q1 - 1.5 * IQR, Q3 + 1.5 * IQR
7

2
8 # Remove outliers
9 dataset = dataset[(dataset[’rating’] >= lower_bound) & (dataset[’
rating’] <= upper_bound)]
10 print([Link]())

7 Encode Categorical Variables

Convert categorical columns (e.g., genre) to numerical format using


one-hot encoding.
1 # One-hot encode ’genre’
2 dataset = pd.get_dummies(dataset, columns=[’genre’], prefix=’genre’)
3 print([Link]())

8 Clean Text Data

Remove unwanted characters from text columns like title using regular
expressions.
1 import re
2

3 # Clean ’title’
4 dataset[’title’] = dataset[’title’].apply(lambda x: [Link](r’[^a-zA-
Z0-9\s]’, ’’, str(x)))
5 print(dataset[’title’].head())

9 Filter Irrelevant Data

Remove irrelevant rows (e.g., movies before 2000) or columns (e.g.,


comments).
1 # Keep movies from 2000 or later
2 dataset = dataset[dataset[’year’] >= 2000]
3

4 # Drop irrelevant column


5 [Link](columns=[’comments’], inplace=True, errors=’ignore’)
6 print([Link]())

10 Validate and Save

Verify the cleaned dataset and save it to a new CSV file.


1 # Final inspection
2 print([Link]())
3 print([Link]().sum())
4

3
5 # Save cleaned dataset
6 dataset.to_csv(’movies_cleaned.csv’, index=False)
7 print(”Cleaned dataset saved as ’movies_cleaned.csv’”)

11 Notes

• Ensure [Link] is in the working directory or provide the


full path.
• Use [Link]() to check the working directory.
• Adjust steps based on specific dataset issues (e.g., unique
columns or formats).

You might also like