0% found this document useful (0 votes)
7 views1 page

Data Preprocessing

The document outlines essential steps for data cleaning, feature engineering, and transformation, including handling missing values, encoding categorical variables, and scaling numerical features. It also discusses techniques for balancing datasets and preprocessing text data. Key methods include imputation, standardization, and the use of SMOTE for class balancing.

Uploaded by

och66666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views1 page

Data Preprocessing

The document outlines essential steps for data cleaning, feature engineering, and transformation, including handling missing values, encoding categorical variables, and scaling numerical features. It also discusses techniques for balancing datasets and preprocessing text data. Key methods include imputation, standardization, and the use of SMOTE for class balancing.

Uploaded by

och66666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd

Data Cleaning

Handle missing values

Remove rows/columns with too many missing values

Fill (impute) missing values (mean, median, mode, interpolation, etc.)

Fix data types (e.g., convert strings to dates, integers to floats)

Remove or correct outliers

Use z-score, IQR method, or visual tools like boxplots

Standardize categorical values (e.g., fix typos in category labels)

Drop irrelevant features that won’t contribute to model performance

2. Feature Engineering
Encoding categorical variables

One-hot encoding

Label encoding / ordinal encoding

Scaling numerical features

Standardization (z-score)

Normalization (min-max scaling)

Feature creation

Combine or transform existing features into more useful ones

3. Data Transformation
Log transformation to reduce skewness

Binning (e.g., convert age to age groups)

Date/time feature extraction (e.g., extracting hour/day/month from timestamps)

4. Balancing the dataset


SMOTE or other over-/under-sampling techniques (on training set only)

Class weight adjustment if using models that support it

5. Text Data Preprocessing (if applicable)


Lowercasing, removing punctuation, stopword removal

Tokenization

Stemming/Lemmatization

TF-IDF or word embeddings for feature extraction

You might also like