0% found this document useful (0 votes)
38 views4 pages

DVA Assignment 1

The document outlines key concepts in data analytics, including the Analytics Process Model, data collection processes, sampling methods, handling missing values, outlier detection, standardization techniques, and categorization versus segmentation. It emphasizes the importance of structured approaches in analytics for informed decision-making and highlights various methods for data handling and analysis. The content serves as a comprehensive guide for understanding fundamental analytics concepts and their applications.

Uploaded by

Tanya Maheshwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views4 pages

DVA Assignment 1

The document outlines key concepts in data analytics, including the Analytics Process Model, data collection processes, sampling methods, handling missing values, outlier detection, standardization techniques, and categorization versus segmentation. It emphasizes the importance of structured approaches in analytics for informed decision-making and highlights various methods for data handling and analysis. The content serves as a comprehensive guide for understanding fundamental analytics concepts and their applications.

Uploaded by

Tanya Maheshwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

DVA Assignment 1

Name: Tanya Maheshwari


Enrollment No. 02613702022
Submitted to: Ms. Kavita Srivastva
1. Explain Analytics Process Model
Answer:
The Analytics Process Model outlines the structured approach used to derive insights from data and
support decision-making. It typically includes the following steps:
1. Problem Definition: Clearly define the business or research problem.
2. Data Collection: Gather relevant data from primary or secondary sources.
3. Data Preparation: Clean, integrate, and transform data for analysis.
4. Exploratory Data Analysis (EDA): Use statistical and visual methods to understand
patterns, trends, and anomalies.
5. Model Building: Apply statistical or machine learning models to solve the problem.
6. Validation and Testing: Evaluate model performance using test data.
7. Deployment: Implement the solution into the business environment.
8. Monitoring and Feedback: Continuously monitor model accuracy and update as needed.
This model ensures that analytics is goal-driven, repeatable, and actionable, helping organizations
gain insights and make informed decisions.
2. Describe Data Collection Process. Differentiate Between Primary and Secondary Data.
Discuss Ways of Obtaining Them.
Answer:
Data Collection is the process of gathering information to analyze and draw conclusions. It can be
done manually or automatically, and the quality of collected data directly impacts analysis.
Primary Data:
 Collected firsthand for a specific purpose.
 Methods: Surveys, interviews, experiments, focus groups, observations.
Secondary Data:
 Already collected and available for use.
 Sources: Government reports, company records, published research, online databases.
Differences:

Feature Primary Data Secondary Data

Purpose Specific, original Previously collected

Accuracy Higher (custom) Variable (dependent on source)


Cost High Low or free

3. Explain Different Ways of Sampling. What Are the Benefits of Sampling?


Answer:
Sampling is the process of selecting a subset from a larger population to analyze and draw
conclusions.
Types of Sampling:
 Probability Sampling: Every element has a known chance of selection.
o Simple Random Sampling

o Stratified Sampling

o Systematic Sampling

o Cluster Sampling

 Non-Probability Sampling: No known probability of selection.


o Convenience Sampling

o Judgmental Sampling

o Snowball Sampling

o Quota Sampling

Benefits of Sampling:
 Cost-Effective: Reduces data collection and processing costs.
 Time-Saving: Quicker than analyzing the entire population.
 Efficient: Useful when population is too large or inaccessible.
 Accurate: Provides reliable results when done correctly.
Sampling enables researchers to make generalized inferences without studying every data point.

4. Explain Different Ways of Handling Missing Values


Answer:
Handling missing values is crucial for accurate analysis and modeling. Common techniques include:
1. Deletion:
o Listwise Deletion: Remove rows with any missing value.

o Column Deletion: Remove features with high missingness.

o Best for small datasets or when few values are missing.

2. Imputation:
o Mean/Median/Mode Imputation: Replace with average or most frequent value.
o Forward/Backward Fill: Use previous or next value in sequence.

o KNN Imputation: Uses k-nearest neighbors to estimate missing values.

o Regression Imputation: Predict missing values using other features.

3. Advanced Techniques:
o Multiple Imputation or Machine Learning Models for more complex scenarios.

Proper handling of missing data avoids bias, improves model performance, and ensures data integrity.
5. What Are Outliers? How Are Outliers Detected and Handled Using Python?
Answer:
Outliers are data points significantly different from other observations. They can distort statistical
analysis and model accuracy.
Detection Methods in Python:
python
CopyEdit
import numpy as np
import pandas as pd
# Using Z-score
from scipy import stats
z_scores = [Link]([Link](df['column']))
outliers = df[z_scores > 3]\
# Using IQR
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[(df['column'] < (Q1 - 1.5*IQR)) | (df['column'] > (Q3 + 1.5*IQR))]
Handling Techniques:
 Removal: Drop rows with extreme outliers.
 Transformation: Apply log or square root to reduce skewness.
 Capping (Winsorization): Limit extreme values to certain percentiles.
Outlier handling improves data quality and model robustness.
6. Differentiate Between Min/Max and Z-score Methods of Standardization
Answer:

Feature Min/Max Normalization Z-score Standardization


x' = \frac{x - \min(x)}{\max(x) - \
Formula x' = \frac{x - \mu}{\sigma}
min(x)}

Output Range [0, 1] or any defined range Mean = 0, SD = 1

Sensitive to
Yes Less sensitive
Outliers

When data needs to be scaled to a When data needs to follow normal


Usage
specific range distribution

7. Explain Categorization and Segmentation


Answer:
Categorization is the process of labeling or grouping data based on predefined categories or
attributes. It simplifies analysis by classifying data into manageable groups.
Example: Classifying customers into categories such as "new," "loyal," or "inactive."
Segmentation goes further by dividing data into meaningful, often homogeneous subgroups, based
on behavior, demographics, or purchasing habits.
Example: Customer segmentation for marketing based on age, buying frequency, or location.
Differences:
 Categorization is rule-based and static.
 Segmentation is dynamic and often data-driven (e.g., using clustering algorithms like K-
Means).
Both are essential for targeted marketing, personalized recommendations, and efficient decision-
making in data-driven environments.

You might also like