Data Cleaning
Handle missing values
Remove rows/columns with too many missing values
Fill (impute) missing values (mean, median, mode, interpolation, etc.)
Fix data types (e.g., convert strings to dates, integers to floats)
Remove or correct outliers
Use z-score, IQR method, or visual tools like boxplots
Standardize categorical values (e.g., fix typos in category labels)
Drop irrelevant features that won’t contribute to model performance
2. Feature Engineering
Encoding categorical variables
One-hot encoding
Label encoding / ordinal encoding
Scaling numerical features
Standardization (z-score)
Normalization (min-max scaling)
Feature creation
Combine or transform existing features into more useful ones
3. Data Transformation
Log transformation to reduce skewness
Binning (e.g., convert age to age groups)
Date/time feature extraction (e.g., extracting hour/day/month from timestamps)
4. Balancing the dataset
SMOTE or other over-/under-sampling techniques (on training set only)
Class weight adjustment if using models that support it
5. Text Data Preprocessing (if applicable)
Lowercasing, removing punctuation, stopword removal
Tokenization
Stemming/Lemmatization
TF-IDF or word embeddings for feature extraction