0% found this document useful (0 votes)
28 views6 pages

Data Science Pyqdata Science Pyqdata Science Pyq

data science pyqdata science pyqdata science pyqdata science pyq

Uploaded by

namrata.paropate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views6 pages

Data Science Pyqdata Science Pyqdata Science Pyq

data science pyqdata science pyqdata science pyqdata science pyq

Uploaded by

namrata.paropate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Q1) Attempt any Five of the following : [5 × 2 = 10]

a) What is ANOVA Test?


ANOVA (Analysis of Variance) is a statistical technique used to compare the means of three
or more groups to determine if at least one group mean is significantly different. It is
commonly used when testing differences between multiple sample groups.

b) What is Descriptive Statistics?


Descriptive statistics summarize and describe the basic features of a dataset. It includes
measures like:

 Mean (average)
 Median (middle value)
 Mode (most frequent value)
 Range, Variance, Standard Deviation

c) Define Ratio variable and Interval variable.

 Ratio Variable: A numeric variable with a meaningful zero, allowing comparison of


absolute magnitudes. (e.g., height, age, income)
 Interval Variable: A numeric variable with equal intervals between values but no
true zero. (e.g., temperature in Celsius)

d) Write any four applications of Data Science.

1. Fraud detection in banking


2. Predictive analytics in healthcare
3. Customer segmentation in marketing
4. Recommendation systems in e-commerce

e) What is Data Preprocessing?


Data preprocessing is a data mining technique that involves transforming raw data into a
clean and understandable format. It includes:

 Data cleaning
 Normalization
 Handling missing values
 Data transformation
f) What is Exploratory Data Analysis?
EDA is an approach to analyzing data sets to summarize their main characteristics, often
using visual methods such as:

 Histograms
 Box plots
 Correlation matrices
It helps in identifying patterns, outliers, and data structures.

Q2) [3 × 4 = 12]
a) Explain Data Wrangling Process.
Data wrangling, or data munging, is the process of cleaning and transforming raw data into a
usable format. The steps include:

1. Data Collection – Gathering data from multiple sources


2. Data Cleaning – Fixing or removing incorrect, corrupted, or missing data
3. Data Structuring – Converting data into the required format
4. Data Enrichment – Enhancing data by merging with other datasets
5. Validation and Storage – Ensuring accuracy and saving it for analysis

b) Briefly explain Lifecycle of Data Science.

1. Problem Definition – Understanding business requirements


2. Data Collection – Acquiring data from various sources
3. Data Preparation – Cleaning and preprocessing
4. EDA – Understanding trends and patterns
5. Model Building – Applying machine learning algorithms
6. Model Evaluation – Validating performance
7. Deployment – Integrating model into production
8. Monitoring – Ensuring continuous performance

c) Explain Central Tendencies with Examples.


Central tendency refers to the center of a data distribution.

 Mean = Average. Ex: (10+20+30)/3 = 20


 Median = Middle value. Ex: 10, 20, 30 → Median = 20
 Mode = Most frequent value. Ex: 10, 10, 20 → Mode = 10
Q3) [3 × 4 = 12]
a) Calculate Variance and Standard Deviation
Data: 92, 95, 85, 80, 75, 50

 Mean = (92+95+85+80+75+50)/6 = 79.5


 Squared deviations:
(92-79.5)² = 156.25
(95-79.5)² = 240.25
(85-79.5)² = 30.25
(80-79.5)² = 0.25
(75-79.5)² = 20.25
(50-79.5)² = 870.25
 Sum = 1317.5
 Variance = 1317.5 / 6 ≈ 219.58
 Standard Deviation = √219.58 ≈ 14.82

b) Reasons for Preprocessing Data

1. Remove missing or inconsistent values


2. Convert data types
3. Normalize or scale features
4. Encode categorical variables
5. Improve model accuracy
6. Handle outliers and noise

c) Toolbox used by Data Scientists

 Languages: Python, R
 Libraries: Pandas, NumPy, Scikit-learn, TensorFlow
 Visualization Tools: Matplotlib, Seaborn
 IDE & Tools: Jupyter Notebook, RStudio
 Platforms: AWS, Google Cloud, GitHub

Q4) [3 × 4 = 12]
a) Data Visualization Techniques

1. Histogram – Frequency distribution


2. Bar Chart – Compare categories
3. Pie Chart – Show proportions
4. Line Graph – Trends over time
5. Box Plot – Distribution and outliers
6. Heatmap – Correlation matrix

b) What is Data Transformation? Rescaling Example


Data Transformation modifies data format or scale to enhance performance or
interpretability.
Rescaling (Min-Max Normalization):
xnorm=x−xminxmax−xminx_{norm} = \frac{x - x_{min}}{x_{max} - x_{min}}
E.g., Value 50 in range 0–100 → (50-0)/(100-0) = 0.5

c) Structured vs Unstructured Data

 Structured: Organized and stored in databases (e.g., Excel files, SQL tables)
 Unstructured: No predefined format (e.g., emails, videos, images)
Examples:
Structured → Employee database
Unstructured → Customer reviews on Amazon

Q5) [3 × 4 = 12]
a) Percentiles and Quartiles with Examples

 Percentile: Position of a value in 0–100 scale


 Quartile: Divides data into 4 parts
Example for data: 10, 20, 30, 40, 50
Q1 = 20 (25th percentile), Q2 = 30 (median), Q3 = 40 (75th percentile)

b) Five Steps of Hypothesis Testing

1. Define null (H₀) and alternative (H₁) hypothesis


2. Choose significance level (α)
3. Select appropriate statistical test
4. Calculate test statistic and p-value
5. Compare p-value with α → Decide to reject or not reject H₀

c) Common Problems with Unstructured Data

1. Lack of consistent format


2. Requires complex processing
3. High storage and processing costs
4. Harder to analyze and visualize
5. Ambiguity in interpretation (e.g., sarcasm in text)

Q6) [3 × 4 = 12]
a) Steps to Calculate p-value

1. Define hypotheses
2. Choose test (e.g., t-test)
3. Calculate test statistic
4. Use distribution to find p-value
5. Compare with α; if p < α, reject H₀

b) Data Cube Aggregation


Data cube is a multi-dimensional array used to store data summarized across multiple
dimensions.
Example: Sales → aggregated by Time, Product, and Region.

c) R Program to Create Data Frame and Sort

# Create Employee Data Frame


emp_id <- c(101, 102, 103, 104, 105)
emp_name <- c("Alice", "Bob", "Charlie", "Diana", "Eve")
emp_salary <- c(45000, 55000, 50000, 48000, 47000)

# Combine into Data Frame


employee <- data.frame(ID = emp_id, Name = emp_name, Salary = emp_salary)

# Sort by Salary
sorted_emp <- employee[order(employee$Salary), ]
print(sorted_emp)

Q7) Write short notes on any two: [2 × 6 = 12]


a) Proximity Measures
Proximity measures help quantify similarity/distance between data points.

 Euclidean Distance: Straight line distance


 Manhattan Distance: Grid-based movement
 Cosine Similarity: Angle between vectors
Used in: clustering, recommendation systems, and pattern recognition.
b) Outliers
Outliers are values that deviate significantly from other observations.

 Detection: Box plot, Z-score


 Causes: Data entry errors, variability
 Impact: Can skew mean and affect model accuracy

c) Data Reduction
Reduces data size while preserving integrity.

 Dimensionality Reduction: PCA, LDA


 Data Compression: Removing redundancy
 Aggregation: Summarizing data
Benefits: Faster processing, reduced storage, better performance

You might also like