1.
Handling Missing Values
Operation: Identify columns with missing values and assess the extent of missingness.
Python Functions:
# Checking for missing values
df.isnull().sum()
# Fill missing values with median
df['mean_cpu_usage_rate'].fillna(df['mean_cpu_usage_rate'].median(), inplace=True)
# Drop rows or columns with too many missing values
df.dropna(axis=0, thresh=5) # Keep rows with at least 5 non-NaN values
2. Removing Duplicate Entries
Operation: Check for and remove duplicate rows.
Python Functions: *
python
# Identifying duplicates
duplicates = df[df.duplicated()]
# Removing duplicates
df.drop_duplicates(inplace=True)
3. Correcting Data Types
Operation: Ensure that columns have the correct data types.
Python Functions:
# Convert column to float
df['mean_cpu_usage_rate'] = df['mean_cpu_usage_rate'].astype(float)
# Convert to datetime
df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
4. Filtering Outliers
Operation: Detect and manage outliers using statistical techniques.
Python Functions:
# Using Z-score to identify outliers
from scipy.stats import zscore
df['zscore'] = zscore(df['mean_cpu_usage_rate'])
outliers = df[(df['zscore'] < -3) | (df['zscore'] > 3)]
# Removing outliers
df = df[(df['zscore'] >= -3) & (df['zscore'] <= 3)]
```
5. Standardizing Units and Scales
Operation: Ensure all measurements are in consistent units and scales.
Python Functions:
python
# Convert bytes to megabytes
df['assigned_memory_usage_MB'] = df['assigned_memory_usage'] / (1024 * 1024)
# Normalize or scale data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['mean_cpu_usage_rate', 'assigned_memory_usage_MB']] = scaler.fit_transform(
df[['mean_cpu_usage_rate', 'assigned_memory_usage_MB']]
6. Handling Inconsistent Entries
Operation: Clean up inconsistencies in the data.
Python Functions:
# Correct inconsistent entries
df['aggregation_type'] = df['aggregation_type'].str.lower().replace(
{'sum': 'sum', 'SUM': 'sum', 'Summation': 'sum'}
7. Correcting Timestamp Misalignments
Operation: Ensure proper alignment of `start_time` and `end_time`.
Python Functions:
# Find rows where end_time is before start_time
misaligned = df[df['end_time'] < df['start_time']]
# Fix or drop these rows as necessary
df = df[df['end_time'] >= df['start_time']]
8. Removing Irrelevant Columns
Operation: Drop columns that are not needed for analysis.
Python Functions:
# Drop unnecessary columns
df.drop(['sample_portion', 'aggregation_type'], axis=1, inplace=True)
```
9. Consistent Handling of Zero or Negative Values
Operation: Identify and handle zero or negative values appropriately.
Python Functions:
# Replace negative or zero values with NaN and then handle them
df['mean_cpu_usage_rate'] = df['mean_cpu_usage_rate'].replace(
lambda x: x if x > 0 else None
df['mean_cpu_usage_rate'].fillna(df['mean_cpu_usage_rate'].median(), inplace=True)
10. Data Sampling and Reduction
Operation: Reduce dataset size without losing critical information.
Python Functions:
# Random sampling of data
sampled_df = df.sample(frac=0.1, random_state=42) # Take 10% sample
# Aggregating data to hourly means
df['hourly_time'] = df['start_time'].dt.floor('H')
aggregated_df = df.groupby('hourly_time').agg({
'mean_cpu_usage_rate': 'mean',
'assigned_memory_usage_MB': 'sum'
}).reset_index()
```
By following these steps and utilizing the corresponding Python functions, you can effectively clean the
Google Cluster Dataset, preparing it for further analysis and ensuring that the insights you derive will be
reliable and accurate.