Reg.
# : 20NF
P18PECS021
(only above code to be shaded in the Answer book #)
Bharath Institute of Higher Education & Research, Chennai – 73
Ph.D, dept, May / June - 2025
P18PECS021 – Data science
Time: 3 Hrs Maximum: 100
Marks
(10 x 2 =
20)
Part A
Answer All Questions
1. What is Data Science?
2. Explain the concept of 'Technology' in the context of data science.
3. What are the different sources of data in data science?
4. Define API and its role in data collection.
5. Define variance and its significance in data analysis.
6. What is the Central Limit Theorem (CLT)?
7. What are the main types of data visualization?
8. Explain the role of "retinal variables" in data visualization.
9. List two applications of data science in the healthcare industry.
10. What is Bokeh and how is it used for data visualization in Python?
1
1. What is Data Science?
Answer: Data Science is an interdisciplinary field that uses scientific
methods, processes, algorithms, and systems to extract knowledge and
insights from structured and unstructured data.
Example: Netflix uses data science to analyze users' watch history
and recommend personalized movie suggestions.
2. Explain the concept of 'Technology' in the context of Data Science.
Answer: Technology in Data Science refers to the tools, software,
and frameworks used to collect, store, process, analyze, and visualize
data.
Example: Python and R are popular programming languages used
for data analysis, machine learning, and visualization in Data
Science.
3. What are the different sources of data in Data Science?
Answer: Data can come from multiple sources, including:
- Structured sources (databases like MySQL)
- Unstructured sources (social media posts, text files)
- Sensor data (IoT devices)
- Web data (scraped web pages)
Example: Google Analytics collects web data to help businesses
understand user behavior on their websites.
4. Define API and its role in data collection.
2
Answer: An API (Application Programming Interface) allows
different software applications to communicate and exchange data.
Example: The Twitter API enables developers to fetch tweets, analyze
trends, and perform sentiment analysis.
5. Define variance and its significance in data analysis.
Answer: Variance measures the spread of data points from the mean,
helping determine how much data varies.
Example: If a company's monthly sales vary widely, the variance will
be high, indicating inconsistency in performance.
6. What is the Central Limit Theorem (CLT)?
Answer: CLT states that the distribution of the sample means will
approximate a normal distribution as the sample size increases,
regardless of the population distribution.
Example: If a researcher collects multiple samples of student heights
from different schools, their average height distribution will tend
toward a normal curve.
7. What are the main types of data visualization?
Answer: Common data visualization types include:
- Bar Charts (Comparing categories)
- Line Graphs (Trends over time)
- Scatter Plots (Relationships between variables)
- Heatmaps (Patterns in large datasets)
3
Example: A company might use a line graph to track its monthly
revenue growth.
8. Explain the role of "retinal variables" in data visualization.
Answer: Retinal variables, such as size, shape, color, and orientation,
help in encoding data visually to improve understanding.
Example: Heatmaps use color intensity to show variations in values,
making patterns more recognizable.
9. List two applications of data science in the healthcare industry.
Answer:
- Disease Prediction: AI models analyze patient symptoms and
medical history to predict illnesses early.
- Medical Imaging Analysis: Machine learning aids in detecting
anomalies in X-rays and MRIs.
Example: IBM Watson helps doctors diagnose diseases by analyzing
vast amounts of medical data.
10. What is Bokeh and how is it used for data visualization in Python?
Answer: Bokeh is a Python library that creates interactive and
visually appealing visualizations for web applications.
Example: A data analyst can use Bokeh to build an interactive
dashboard displaying real-time sales trends.
4
Part B (5 x 6 = 30)
Answer either (a) or (b) from each question
1. (a) Discuss the role of Data Science in modern industries and how it
drives decision-making processes. (or)
(b) Explain the importance of the Data Science Process and its stages
in transforming raw data into actionable insights.
2. (a) Analyze the challenges involved in collecting data from multiple
sources and propose strategies for managing and integrating these
data sources. (or)
(b) Discuss the significance of APIs in modern data collection. How
do they facilitate data exchange across systems?
3. (a) Examine the concept of central tendency in statistics. How do
measures like mean, median, and mode contribute to understanding
data? (or)
(b) Discuss the Central Limit Theorem (CLT) and its significance in
statistical analysis and hypothesis testing.
4. (a) Describe the different types of data visualizations and the types
of data each is best suited for. (or)
(b) Evaluate the importance of mapping variables to visual
encodings in data visualization and how it impacts the clarity of
insights.
5. (a) Discuss the various applications of data science in sectors like
healthcare, finance, and e-commerce, and how they contribute to
solving industry-specific challenges. (or)
(b) Analyze the role of Bokeh in Python for creating interactive
visualizations. How does it differ from other visualization libraries
like Matplotlib and Seaborn?
5
1. (a) Role of Data Science in Modern Industries & Decision-
Making
Data Science plays a crucial role in modern industries by
analyzing large volumes of data to uncover patterns, trends, and
insights that drive business decisions. Companies across various
sectors use Data Science to optimize operations, improve
customer experience, and increase profitability.
Key Roles in Different Industries:
- Healthcare: Predict disease outbreaks, analyze patient
records for personalized treatments.
- Finance: Detect fraudulent transactions, assess credit risks.
- Retail & E-commerce: Enhance customer recommendations,
forecast demand trends.
- Manufacturing: Optimize supply chain efficiency, reduce
production downtime.
- Marketing: Perform sentiment analysis, target
advertisements effectively.
Example:
6
Netflix uses Data Science to analyze user preferences and
recommend personalized content, leading to increased
engagement and subscriptions.
1. (b) Importance of the Data Science Process
The Data Science process is a structured approach for
converting raw data into actionable insights. It consists of
multiple stages that ensure data-driven decision-making.
Stages of the Data Science Process:
1. Data Collection: Gathering relevant data from sources like
APIs, databases, and web scraping.
2. Data Cleaning: Removing inconsistencies, missing values,
and errors.
3. Exploratory Data Analysis (EDA): Understanding data
patterns and relationships using visualization.
4. Feature Engineering: Creating meaningful variables for
better predictive models.
5. Model Building & Training: Applying Machine Learning
algorithms to derive insights.
6. Evaluation & Deployment: Assessing model accuracy and
integrating insights into decision-making.
7. Monitoring & Improvement: Continuously refining models
based on new data.
7
Example:
A retail company may use Data Science to forecast sales by
analyzing past trends, seasonal effects, and customer buying
habits.
2. (a) Challenges in Collecting Data from Multiple Sources &
Solutions
Data collection from various sources presents several challenges,
including inconsistencies, privacy concerns, and integration
difficulties.
Challenges:
- Data Format Variability: Different sources use different
formats, making integration complex.
- Data Accuracy & Quality Issues: Incorrect, incomplete, or
duplicated data can mislead analytics.
- Scalability Concerns: Handling large datasets requires
efficient infrastructure.
- Security & Privacy Regulations: Compliance with data
protection laws (e.g., GDPR).
Strategies to Overcome Challenges:
8
- Standardizing data formats using ETL (Extract, Transform,
Load) processes.
- Implementing data validation techniques to ensure quality.
- Using cloud storage solutions for scalability.
- Adopting encryption and authentication for secure data
handling.
Example:
A company aggregating data from social media, website
analytics, and customer surveys must harmonize different
formats before drawing insights.
2. (b) Significance of APIs in Modern Data Collection
APIs (Application Programming Interfaces) facilitate seamless
data exchange between applications and systems, enabling real-
time data retrieval and automation.
How APIs Help in Data Collection:
- Automated Data Access: Allows direct extraction without
manual entry.
- Interoperability Across Systems: Enables different platforms
to communicate.
- Scalability & Efficiency: Handles high-volume data requests
dynamically.
- Real-Time Insights: Provides up-to-date information for
decision-making.
9
Example:
The Twitter API allows businesses to fetch live tweets, analyze
trends, and monitor customer sentiments to refine marketing
strategies.
3. (a) Concept of Central Tendency in Statistics
Central tendency describes how data points are distributed
around a central value, helping summarize data with key
metrics: Mean, Median, and Mode .
Definitions:
- Mean (Average): Sum of all values divided by total count.
- Median: Middle value when data is sorted.
- Mode: Most frequently occurring value.
Example:
In exam scores—if the marks are [60, 70, 80, 90, 100]:
- Mean = (60+70+80+90+100)/5 = 80
- Median = 80 (middle value)
- Mode = If 80 appears most frequently, it’s the mode.
3. (b) Central Limit Theorem (CLT) in Statistics
10
The CLT states that the distribution of sample means
approximates a normal distribution as the sample size
increases, regardless of the original data's distribution.
Significance of CLT:
- Helps estimate population characteristics from sample data.
- Forms the foundation of hypothesis testing and confidence
intervals.
- Enables predictions in financial risk analysis, healthcare, and
business intelligence.
Example:
If multiple small samples of customer purchasing amounts are
taken, their average will eventually form a normal distribution.
4. (a) Types of Data Visualizations & Their Uses
Data visualization helps make complex data understandable
through graphical representations. Different types serve various
purposes:
- Bar Chart: Comparing categorical data (e.g., sales by
product category).
- Line Graph: Showing trends over time (e.g., stock prices).
- Scatter Plot: Illustrating relationships (e.g., correlation
between height and weight).
11
- Heatmap: Displaying patterns in a large dataset (e.g., website
user activity).
Example:
A heatmap visualizing user clicks on an e-commerce website
helps identify popular sections.
4. (b) Mapping Variables to Visual Encodings
Retinal variables (size, shape, color, orientation) help in visual
encoding, improving clarity and interpretation.
Impact of Proper Mapping:
- Enhances user comprehension.
- Avoids misleading interpretations.
- Highlights key insights effectively.
Example:
Using different colors for data points in a scatter plot to
distinguish between groups improves readability.
5. (a) Applications of Data Science in Various Industries
Data Science has revolutionized numerous industries:
12
- Healthcare: Disease prediction, personalized treatment
recommendations.
- Finance: Fraud detection, stock market trend analysis.
- E-commerce: Customer preference analysis,
recommendation engines.
Example:
Amazon uses predictive analytics to suggest products based on
browsing and purchase history.
5. (b) Role of Bokeh in Python for Interactive Visualizations
Bokeh is a Python visualization library known for its interactive
and web-based visualizations.
Comparison with Other Libraries:
- Matplotlib: Suitable for static plots (e.g., reports).
- Seaborn: Enhances statistical visualizations (e.g., correlation
matrices).
- Bokeh: Best for dynamic dashboards and interactive
applications.
Example:
A data analyst can use Bokeh to create interactive stock market
graphs where users can zoom, hover, and filter data.
13
Part C (5 x 10 = 50)
Answer Five questions out of Seven
1. Discuss the core concepts of Data Science and how they are applied
across various industries such as healthcare, finance, and retail.
2. Analyze the different sources of data used in Data Science and
explain how they contribute to the data collection process.
3. Explain the concept of central tendency and its importance in
statistical analysis. Discuss how measures like mean, median, and
mode are used to summarize a dataset.
4. Discuss the importance of data visualization in Data Science and
explain how it aids in decision-making.
5. Analyze the various applications of Data Science in business,
particularly in areas like marketing, customer segmentation, and
fraud detection.
6. Discuss the challenges of visualizing time-series data and the
techniques that can be used to overcome them. How can
visualization help in identifying trends and patterns in such data?
14
7. Explain the importance of structure learning techniques in graph
mining. How do constraint-based and score-based algorithms differ,
and when would you use each approach?
1. Core Concepts of Data Science and Their Applications Across
Industries
Data Science is a multidisciplinary field that involves extracting
meaningful insights from structured and unstructured data
using statistical, machine learning, and computational
techniques. Its core concepts include data collection, cleaning,
analysis, visualization, and interpretation.
# Applications Across Industries
- Healthcare: Data Science helps in predictive analytics,
personalized treatment, and medical imaging. For example, AI-
driven diagnostics assist radiologists in detecting diseases like
cancer from medical scans.
- Finance: Fraud detection, risk assessment, and algorithmic
trading are major applications. Banks utilize machine learning
models to detect anomalies in transaction patterns.
15
- Retail: Data Science enhances customer segmentation,
inventory management, and personalized marketing strategies.
Companies like Amazon use recommendation systems to provide
personalized product suggestions.
2. Sources of Data in Data Science and Their Contribution to
Data Collection
Data Science relies on various sources of data that contribute to
the data collection process:
- Structured Data: Organized into rows and columns (e.g.,
relational databases, spreadsheets).
- Unstructured Data: Includes images, text, videos, social
media posts, and logs.
- Real-Time Data: Streaming data from sensors, financial
transactions, or user interactions.
- Public Data: Open-source datasets such as government
statistics, research papers, or census information.
Each type of data contributes to a holistic analysis, enabling
deeper insights into trends, behaviors, and anomalies.
3. Central Tendency and Its Importance in Statistical Analysis
16
Central tendency measures summarize a dataset by identifying a
central value around which the data is distributed. The three key
measures include:
- Mean: The arithmetic average of a dataset, used in academic
grading and financial metrics.
- Median: The middle value, useful in skewed distributions
like income levels.
- Mode: The most frequently occurring value, applied in
categorical data analysis.
Understanding central tendency is crucial for decision-making,
as it provides a representative value that simplifies complex
datasets.
4. Importance of Data Visualization in Data Science and Its Role
in Decision-Making
Data visualization translates complex data into graphical
representations, making it easier to identify patterns, trends, and
outliers.
# Advantages
- Enhanced Understanding: Graphs, charts, and heatmaps
provide intuitive insights.
17
- Improved Decision-Making: Businesses use dashboards for
real-time monitoring of performance.
- Simplified Communication: Visual data is more accessible to
stakeholders with varying technical expertise.
For example, a retailer analyzing sales performance using bar
graphs can easily identify seasonal trends and adjust inventory
accordingly.
5. Applications of Data Science in Business: Marketing,
Customer Segmentation, and Fraud Detection
Businesses leverage Data Science to enhance operations and
profitability:
- Marketing: Predictive analytics optimizes advertisement
targeting and pricing strategies.
- Customer Segmentation: Clustering techniques group
customers based on preferences, leading to personalized
promotions.
- Fraud Detection: Banks implement anomaly detection
models to flag suspicious transactions, preventing financial
losses.
For example, Netflix uses Data Science to recommend content
based on viewing history, increasing user engagement.
18
6. Challenges of Visualizing Time-Series Data and Overcoming
Them
Time-series data, which tracks variables over time, poses unique
visualization challenges:
- Data Overload: Large datasets may clutter graphs.
- Seasonality and Trends: Identifying recurring patterns can
be complex.
- Noise in Data: External fluctuations obscure meaningful
insights.
# Techniques to Overcome Challenges
- Smoothing Techniques: Moving averages help reduce noise.
- Interactive Visualizations: Dynamic graphs allow zooming
into specific time frames.
- Feature Engineering: Extracting time-based features
improves pattern recognition.
Visualizing stock market trends using candlestick charts helps
traders make informed investment decisions.
7. Importance of Structure Learning in Graph Mining and
Comparison of Algorithms
19
Graph mining helps uncover relationships in interconnected
data, such as social networks and transportation systems.
# Structure Learning in Graph Mining
It involves identifying underlying patterns in graphs to optimize
decision-making.
# Comparison of Algorithms
- Constraint-Based Algorithms: Define rules and dependencies
(e.g., Bayesian networks).
- Score-Based Algorithms: Use optimization techniques (e.g.,
Maximum Likelihood Estimation).
Constraint-based approaches are preferable when domain
knowledge is available, while score-based methods work well for
complex data with unknown relationships.
******
20