Key Problems in the Dataset
1.Missing Values in the "Daily Internet Usage (Minutes)" Column can reduce the accuracy of
analysis, skew averages, and limit insights on usage trends.
2. Unrealistic Ages (2 and 150 years old):
These outliers distort demographic analyses and can lead to unreliable insights when
analyzing correlations between age and digital access behaviors.
3. Inconsistent Responses (e.g., "No Internet Access" but smartphone usage reported):
Contradictory information undermines data integrity, leading to misleading insights when
analyzing device usage patterns and access challenges can affect analysis.
Steps to Handle Missing Values in "Daily Internet Usage (Minutes)"
1. Evaluate the extent of missing data: Identify whether the missing values are random or
systematic.
2. Choose an imputation strategy/method:
- If data is missing at random, use the median or mean for imputation.
- Group-based imputation could be useful if usage correlates with variables such as age,
region, or device type.
3. Flag missing data: Maintain transparency by flagging records where imputation was
applied.
Handling Duplicate Records
1. Check for exact duplicates: Identify identical records by comparing all fields.
2. Verify key attributes: Ensure the respondent IDs, survey dates, and demographic details
match perfectly.
3. Resolve discrepancies: Keep only the most recent or complete entry where duplicates
have conflicting data.
4. Remove identical entries: If records are fully redundant, drop the duplicates.
Dealing with Unrealistic Ages (Outliers)
1. Set logical bounds: Apply an age range filter (e.g., 10 to 100 years).
2. Flag or remove outliers: Exclude records that fall outside the defined range.
3. Consider contextual validation: Validate questionable ages if supporting data (e.g.,
parental guidance) justifies exceptions.
Why Inconsistent Responses Are Problematic
- Data Integrity Issues: Contradictory data undermines trust in results and affects
downstream analyses.
- Misleading Insights: It becomes challenging to infer meaningful patterns when core
information is unreliable.
- Analysis Barriers: Filters and aggregations yield flawed outcomes due to contradictory
responses.
- Resolution: Re-verify ambiguous responses and, if necessary, exclude inconsistent records
from critical analyses.
Standardizing Inconsistent Date Formats
1. Identify patterns: Use tools to detect the different formats present (e.g., DD/MM/YYYY,
MM-DD-YYYY).
2. Standardize formats: Convert all dates to a uniform format (e.g., ISO 8601: YYYY-MM-
DD).
3. Automate validation: Implement data cleaning scripts to catch format errors.
4. Validate changes: Cross-check converted dates for accuracy.
Impact of inconsistent dates:
- Inability to sort or analyze trends chronologically.
- Errors in calculating time-based metrics like engagement periods or survey timelines.
Section B: Multiple-Choice Questions
1.C- Increased sample size
2. B- Data cleaning and preprocessing
3. B- Replace missing values with the column average (mean).
Section C: True or False Questions
1. False — Outliers should be assessed case by case.
2. False — Duplicates should be checked for validity before removal.
3. True — Dropping a column is a good approach if it has too many missing values.
4. True — Typographical errors can lead to incorrect analysis.
5. True — Data cleaning is an essential part of data analysis.
Section D: Fill in the Gaps
1. The process of identifying and fixing errors in data is called data cleaning.
2. If a dataset has missing values, we can either remove or impute them.
3. A duplicate record means the same observation appears multiple times in the dataset.
4. Data should be accurate, meaning free of errors and inconsistencies.
5. The presence of extreme values in a dataset is called outliers. Why Inconsistent
Responses Are Problematic
- Data Integrity Issues: Contradictory data undermines trust in results and affects
downstream analyses.
- Misleading Insights: It becomes challenging to infer meaningful patterns when core
information is unreliable.
- Analysis Barriers: Filters and aggregations yield flawed outcomes due to contradictory
responses.
- Resolution: Re-verify ambiguous responses and, if necessary, exclude inconsistent records
from critical analyses.