Ma’am Shabana Kousar Computer Science Unit 04
Unit 04
Data And Analysis
EXERCISE
Multiple Choice Questions (MCQs)
MCQ 1 2 3 4 5 6 7 8 9 10
Answer A C B B B C A A D D
Short Questions
Q1. List out the parameters and statistics from given statements.
a) Average length of height of a giraffe.
b) Average weight of watermelon
c) There are 430 doctors in a hospital.
d) Average age of students of 6th class in a school is 12 years.
Parameters
Average length of height of a giraffe.
Parameter: The average length of height of all giraffes (since it refers to the population of giraffes).
Statistic: The average length of height of a sample of giraffes (if a sample of giraffes was measured).
b) Average weight of watermelon.
Parameter: The average weight of all watermelons (if considering the entire population of
watermelons).
Statistic: The average weight of a sample of watermelons (if it refers to a sample from the
population).
c) There are 430 doctors in a hospital.
Parameter: The total number of doctors in the hospital (this is a fixed value for the population of
doctors in that hospital).
Statistic: The specific numbers of doctors in particular hospital (430).
d) Average age of students of 6th class in a school is 12 years.
Parameter: The average age of all students in the 6th class (if it refers to the entire population of 6th-
grade students in the school).
Statistic: The average age of a sample of students from the 6th class students in the specified
school(12 years).
In summary:
Parameters refer to the whole population, while statistics refer to a sample from that
population.
Q2. If you want to make a report regarding the products exported from Pakistan in last five
years, how libraries can help you to collect data? Write steps.
Using Libraries for Data Collection
Use online libraries to access specialized trade databases like Trade Map for export data.
Use library catalogs to find books, reports, or research papers on Pakistan’s export
history over the last five years.
Seek help from librarians to locate specific trade-related sources and historical data archives.
Q3. Make a pie chart of vegetable prices in the market. Consider five to ten vegetables.
1
Ma’am Shabana Kousar Computer Science Unit 04
import matplotlib.pyplot as plt
# Data for vegetables and their prices
vegetables = ['Potatoes','Tomatoes','Carrots','Onions']
prices = [50, 30, 20, 10]
# Create a pie chart
plt.figure(figsize=(7, 7))
plt.pie(prices, labels=vegetables, autopct='%1.1f%%')
plt.title('Vegetable Prices in the Market')
plt.axis('equal') # Equal aspect ratio ensures that pie chart is drawn as a circle.
# Display the chart
plt.show
2
Ma’am Shabana Kousar Computer Science Unit 04
Q4. Enlist steps to represent monthly temperatures of a Pakistani city in 2023 from January till
December using a line graph.
Steps:
Get monthly temperatures of a Pakistani city in 2023 from Jan to Dec.
Write this Python code to create a line graph.
b) import numpy as np
c) import matplotlib.pyplot as plt
d) x=np.array(['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September',
'October', 'November', 'December'])
e) y=np.array([15,19,22,25,28,33,36,27,24,21,17,10])
f) plt.plot(x,y, 'o-b',label='Temperatures')
g) plt.xlabel('Month')
h) plt.ylabel('Temperature')
i) plt.legend()
j) plt.xticks(rotation=45)
k) plt.show()
Save the file as “temp.py”
Run the program. The line chart will appear.
3
Ma’am Shabana Kousar Computer Science Unit 04
Long Questions
Q2. Sketch primary data collection methods in the context of disease outbreak like seasonal flu.
Primary Data Collection Methods for Seasonal Flu
The primary data collection methods in the context of disease outbreak like seasonal flue are the
following.
i. Surveys and Questionnaires
The surveys can be distributed to individuals in affected areas to collect data about flu
symptoms, vaccination history and healthcare utilization.
ii. Interviews
The interviews can be conducted with patients and health workers to understand flu symptoms,
timeline and contact with others.
iii. Observations
The observation method is used to monitor and record real time data. You can observe and
document the number of patients come in clinics with flu symptoms.
iv. Medical Record Review
This method is used to analyze patient records for trends and patterns about the flu. The
hospital record can be reviewed to identify the number of flu cases, severity and outcomes.
v. Field Reports
Field reports can be collected from public health workers about the spread of flu, vaccination
rates and community response.
vi. Focus Groups
The qualitative data can be collected from a group of individuals. Focus groups can be held with
community members to discuss their concerns and experiences with the flu.
3. Argue about the use of statistical modeling techniques. Highlight all techniques discussed in
this unit.
Statistical Modeling Techniques
Statistical modeling techniques are used for gathering data. There are two categories of statistical
modeling methods in data analysis.
Supervised learning
Unsupervised learning
1. Supervised Learning
In supervised learning model, the algorithm uses a labeled dataset for learning. The answer key is used
by the algorithm to determine the accuracy of data.
Supervised learning techniques in statistical modeling include regression model and
classification model.
4
Ma’am Shabana Kousar Computer Science Unit 04
Regression Model
If the result, label or outcome of a model is continuous value then it is called regression. The most
common regression model is linear model. A linear regression model is a mathematical equation that
allows us to predict a response for a given predictor value.
Classification Model
If the result, label or outcome of a model is a discrete value then it will be called classification. For
example, if we predict that a certain employee will get raise in salary or not then it is classification.
1. Unsupervised Learning
In unsupervised learning model, the algorithm is given unlabeled data and attempts to extract features
and determine patterns independently. There are two methods of unsupervised learning.
Clustering algorithm
Association rules
Clustering
Clustering means group of data items with maximum similarities. For example, to reduce churn rate
of customer, a telecom company analyzes the usage of customers and divide them into three clusters:
Customers with long call duration.
Customers with heavy internet usage.
Customers with short calls and average Internet usage.
Now the company provides attractive packages to all these kinds of users to retain them.
Association
Association means how likely is to do second action if first action is done. For example, in a
supermarket:
Customer 1 buys bread, milk, tea and tissues.
Customer 2 buys bread, milk, coffee and eggs.
Now if customer 3 buys bread, then there are more chances that he will also buy milk. There is an
association between milk and bread.
K-means Clustering
This algorithm combines a specified number of data points into specific groupings based on
similarities.
5
Ma’am Shabana Kousar Computer Science Unit 04
Q4. Compare linear regression and classification. Emphasize on their respective roles in statistical
modeling.
Linear Regression Model
The most common regression model is linear model. A linear regression model is a
mathematical equation that allows us to predict a response for a given predictor value.
The variable which is used for prediction is called independent variable (x) and the variable to be
predicted based on the value of variable x is called dependent variable (y).
In simple linear regression the equation of the line is y=mx+b where m is the slope and b
is intercept.
Intercept is the point where the graph line meets the y-axis. It is also called y-intercept.
Slope is calculated by the term rise/run where rise is the distance from x-axis to y-intercept and run is
the distance between y-axis and y-intercept.
Classification Model
If the result, label or outcome of a model is a discrete value then it will be called classification.
For example, if we predict that a certain employee will get increase in salary or not then it is
classification. But if we make a prediction that how much salary increase an employee will get
based on some statistical data then it is called regression model.
Linear regression and classification are fundamental techniques in statistical modeling, but they address
different types of problems and have distinct methodologies and roles. Here’s a detailed comparison:
1. Nature of the Problem
Aspect Linear Regression Classification
Type of Regression: Predicts continuous numeric Classification: Predicts discrete categorical
Problem outcomes. outcomes.
Predicting house prices, temperatures, or Predicting whether an email is spam or not, or
Examples
stock values. classifying animals.
2. Key Objective
Aspect Linear Regression Classification
Goal Estimate relationships between variables and predict Assign input data to one or more predefined
6
Ma’am Shabana Kousar Computer Science Unit 04
Aspect Linear Regression Classification
numeric values. categories.
Discrete class labels (e.g., "spam" or "not
Output Continuous numeric values (e.g., 10.5, 20.3).
spam").
3. Methodology
Aspect Linear Regression Classification
Uses algorithms like logistic regression,
Mathematics Fits a linear equation: y=β0+β1x1+…+βnxn
decision trees, or SVMs.
Error Measures residuals using Mean Squared Error
Evaluates accuracy, precision etc.
Metric (MSE) or R-squared.
Assumes a linear relationship between predictors Does not assume linearity; works well with
Linearity
and response. non-linear boundaries.
4. Model Outputs
Aspect Linear Regression Classification
Model Produces a continuous output using a fitted Produces probabilities (e.g., logistic regression)
Equation line. or class labels.
Coefficients show the relationship strength
Interpretability Outputs class probabilities or decision rules.
and direction.
5. Roles in Statistical Modeling
Linear Regression:
Role: Explores relationships between variables, trends, and predictions where the output is numeric.
Applications:
o Predicting future sales or growth rates.
o Analyzing the impact of independent variables on a continuous dependent variable.
o Building foundational models for econometrics or forecasting.
Classification:
Role: Focuses on decision-making and prediction in categorical contexts.
Applications:
o Spam detection, fraud detection, and sentiment analysis.
o Medical diagnostics (e.g., classifying diseases based on symptoms).
o Image recognition and natural language processing.
6. Example Scenario Comparison
Linear Regression: If a company wants to predict the sales revenue based on advertising spend, linear
regression provides a numerical estimate of revenue.
Classification: If a company wants to determine whether a customer will make a purchase (Yes/No),
classification provides categorical predictions.
7
Ma’am Shabana Kousar Computer Science Unit 04
7. Strengths and Weaknesses
Aspect Linear Regression Classification
Simple, interpretable, effective for Versatile, handles complex decision boundaries,
Strengths
continuous data. applicable to categorical outcomes.
Poor performance with non-linear Requires careful handling of imbalanced datasets or
Weaknesses
relationships or categorical outputs. multi-class problems.
Conclusion
Linear regression and classification are complementary tools in statistical modeling, serving distinct
purposes. While linear regression excels at understanding and predicting numeric relationships,
classification is indispensable for decision-making and categorical predictions. Together, they address a
broad spectrum of analytical challenges across domains.
Q5. Defend either of supervised learning and unsupervised learning. Give reasons for your
preference to the other.
Supervise Learning vs Unsupervised Learning
I prefer supervised learning over unsupervised learning because it’s more straightforward and accurate
for many tasks. Here’s why:
Better Predictions
Supervised learning uses labeled data, meaning the model knows the correct answers during
training. This makes it much better at making accurate predictions.
Clear Goals
In supervised learning, the goal is simple: predict the right answer. Since we know what the outcome
should be, it’s easier to measure how well the model is performing. Unsupervised learning doesn’t
have clear goals, so it’s harder to evaluate its success.
Easy to Understand
Supervised learning models are often easier to interpret, especially simpler ones like decision trees.
This makes it easier for people to trust and understand the decisions, which is important in fields like
healthcare or finance.
Real-World Use
Many practical applications, like customer churn prediction, product recommendations, or image
classification, are best handled by supervised learning because it provides more reliable and usable
results.
8
Ma’am Shabana Kousar Computer Science Unit 04
Why I Prefer Supervised Over Unsupervised Learning?
While unsupervised learning is useful for exploring patterns and finding clusters in data, it’s harder
to interpret and act on its results. Supervised learning is more direct, producing clear, accurate
predictions that are easier to use in real-world situations.
Q6. Write a Python code to generate a dataset with variables where y=x 2+2x. Fit scatter plot
and box plot on this data.
Scatter Plot For y=x2+2x
import numpy as np
import matplotlib.pyplot as plt # Correct import for matplotlib.pyplot
# Generate data
x = np.linspace(-15, 30, 50) # Generate 50 points between -15 and 30
y = (x * x) + (2 * x) # Calculate y = x^2 + 2x for each value of x
# Create the plot
plt.figure(figsize=(10, 6)) # Define figure size
plt.scatter(x, y, color='red', label='y = x^2 + 2x') # Create scatter plot
plt.title('Scatter Plot') # Add a title
plt.xlabel('x') # Label the x-axis
plt.ylabel('y') # Label the y-axis
plt.grid(True) # Add grid lines for better visualization
plt.legend() # Display the legend
plt.show() # Render the plot
9
Ma’am Shabana Kousar Computer Science Unit 04
Box Plot For y=x2+2x
import numpy as np
import matplotlib.pyplot as plt # Correct import
# Generate data
x = np.linspace(-15, 30, 100)
y = (x * x) + (2 * x)
# Create a box plot
plt.figure(figsize=(10, 6)) # Define figure size
plt.boxplot(y, vert=False) # Create a horizontal box plot
plt.title('Box Plot') # Correct the title with standard quotes
plt.xlabel('y') # Correct the label with standard quotes
plt.show() # Render the plot
Q7. Relate some real-world examples (other than Airbnb, Facebook and YouTube) where data
science was used to improve marketing strategies and enhance the business.
Real-World Examples of Data Science
Here are some examples of companies using data science to improve their marketing strategies and
enhance the business.
1. Coca-Cola: Personalizing Customer Engagement
Challenge: Maintaining customer loyalty and increasing sales in a competitive beverage market.
Data Science Application:
o Coca-Cola uses data from its vending machines and social media platforms to gather customer
preferences.
1
0
Ma’am Shabana Kousar Computer Science Unit 04
o By analyzing purchase patterns, they create personalized marketing campaigns and offer targeted
promotions.
o Example: Location-based promotions on mobile apps to attract customers to nearby vending
machines.
Impact:
o Improved customer retention and engagement.
o Increased sales through personalized marketing.
2. Netflix: Content Recommendation for Retention
Challenge: Keeping users engaged to reduce churn and drive subscriptions.
Data Science Application:
o Netflix uses collaborative filtering and deep learning to analyze viewer behavior, preferences,
and watch history.
o This data powers its recommendation engine to suggest content tailored to individual users.
Impact:
o Enhanced user satisfaction with personalized recommendations.
o Boosted viewer engagement and extended subscription periods.
3. Starbucks: Optimizing Store Locations and Promotions
Challenge: Identifying profitable locations and increasing foot traffic.
Data Science Application:
o Starbucks uses location-based data and demographic information to determine optimal store
locations.
o Machine learning models predict customer behavior and identify the best areas for targeted
promotions.
o Example: Offering location-specific discounts via the Starbucks Rewards app.
Impact:
o Efficient resource allocation for new store openings.
o Increased customer visits and revenue through targeted marketing.
4. Amazon: Dynamic Pricing
Challenge: Maximizing revenue while staying competitive in a fast-changing e-commerce environment.
Data Science Application:
o Amazon uses machine learning algorithms to analyze competitor pricing, demand trends, and
customer purchase history.
o Prices are adjusted dynamically to optimize sales and profitability.
Impact:
o Improved profit margins and competitive advantage.
o Higher customer satisfaction due to competitive pricing.
5. Zara: Inventory Management and Trend Prediction
Challenge: Minimizing unsold inventory while staying ahead of fashion trends.
Data Science Application:
o Zara uses data from customer purchases, social media trends, and store visits to predict demand
for new styles.
o Machine learning models help optimize inventory levels and supply chain decisions.
Impact:
o Reduced inventory waste and costs.
o Faster turnaround in responding to fashion trends.
6. Spotify: Improving User Engagement
Challenge: Retaining users by providing a highly personalized music experience.
Data Science Application:
o Spotify uses collaborative filtering and natural language processing to analyze user preferences
and generate playlists like "Discover Weekly."
o Targeted ads based on listening habits ensure effective ad placement for free-tier users.
Impact:
o Increased user satisfaction and retention.
1
1
Ma’am Shabana Kousar Computer Science Unit 04
o Enhanced ad revenue through targeted advertising.
7. McDonald’s: Predictive Analytics for Menu Customization
Challenge: Catering to regional tastes while managing global operations.
Data Science Application:
o McDonald’s uses customer feedback, sales data, and regional preferences to decide on menu
changes.
o Predictive analytics helps forecast demand for specific items during promotional periods.
Impact:
o Higher customer satisfaction through localized menu options.
o Better resource management during promotions.
8. Procter & Gamble (P&G): Optimizing Advertising Campaigns
Challenge: Improving the ROI of advertising campaigns.
Data Science Application:
o P&G uses sentiment analysis and customer segmentation to design targeted advertising.
o They analyze historical campaign data to predict the effectiveness of new campaigns.
Impact:
o Increased conversion rates for ads.
o Reduced marketing costs through precise targeting.
These examples illustrate how data science enhances marketing by enabling personalization, optimizing
resources, and predicting customer behavior, ultimately driving business growth
1
2