0% found this document useful (0 votes)
20 views16 pages

Rohan More

Uploaded by

rohanmore19105
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views16 pages

Rohan More

Uploaded by

rohanmore19105
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

1 Q1) Attempt any EIGHT of the following.

*a) What is big data?*


Big data refers to extremely large datasets that cannot
be easily managed, processed, or analyzed using
traditional data processing tools. These datasets are
often characterized by the three V's: Volume (large
amounts of data), Velocity (the speed at which data is
generated), and Variety (different types of data, such as
structured, semi-structured, and unstructured).
*b) What is data manipulation?*
Data manipulation involves the process of adjusting,
organizing, and transforming raw data to make it suitable
for analysis. This can include tasks like sorting, filtering,
merging datasets, or changing data types to ensure
accuracy and consistency.
*c) What is data science?*
Data science is an interdisciplinary field that combines
techniques from statistics, machine learning, data
mining, and computer science to extract meaningful
insights and knowledge from structured and unstructured
data. It involves processes such as data cleaning,
analysis, and predictive modeling.
*d) What is statistical inference?*
Statistical inference is the process of drawing conclusions
about a population based on a sample of data. This
involves using probability theory to estimate population
parameters and test hypotheses about data trends or
relationships.
*e) Enlist the stages of data science?*

2 The stages of data science typically include:


1. *Data Collection*: Gathering raw data from various
sources.
2. *Data Cleaning*: Removing inconsistencies and
handling missing values.
3. *Exploratory Data Analysis (EDA)*: Identifying patterns,
trends, and outliers in the data.
4. *Modeling*: Applying algorithms and statistical models
to the data.
5. *Evaluation*: Assessing the performance of models
using metrics.
6. *Deployment*: Implementing models into production
environments.
7. *Monitoring and Maintenance*: Continuously
evaluating the model's performance.
*f) Define Machine Learning.*
Machine Learning (ML) is a branch of artificial intelligence
that enables systems to learn from data, improve from
experience, and make predictions or decisions without
explicit programming. It involves the use of algorithms to
identify patterns and make data-driven predictions.
*g) Define SVM?*
Support Vector Machine (SVM) is a supervised machine
learning algorithm used for classification and regression
tasks. It works by finding a hyperplane that best
separates data into different classes. SVM can be used for
both linear and nonlinear classification.
3*h) What is the use of histogram?*
A histogram is a graphical representation of the
distribution of numerical data. It helps to visualize the
frequency of data points within certain ranges, providing
insights into the underlying distribution, patterns, and
outliers in the dataset.
*i) What is data analysis?*
Data analysis is the process of inspecting, cleaning,
transforming, and modeling data to discover useful
information, draw conclusions, and support decision-
making. It includes methods like statistical analysis,
exploratory data analysis, and predictive modeling.
*j) What is the use of themes?*
In data visualization, themes refer to the visual style and
design elements used to enhance the presentation of
data. Themes help to standardize the color schemes, font
styles, and layout of charts or graphs, ensuring
consistency and clarity in presenting information.
---### Q2) Attempt any FOUR of the following.
*a) Explain different types of data analytics.*
1. *Descriptive Analytics*: Focuses on summarizing
historical data to understand past behavior and trends.
2. *Diagnostic Analytics*: Investigates why something
happened by analyzing data to identify causes and
relationships.
4 3. *Predictive Analytics*: Uses historical data and
statistical models to forecast future events or trends.
4. *Prescriptive Analytics*: Suggests actions and
outcomes based on data analysis to optimize decision-
making.
*b) Give advantages and disadvantages of Machine
Learning.*
*Advantages*:
- *Automation*: Can automate repetitive tasks and
processes.
- *Accuracy*: Can improve accuracy over time with more
data.
- *Adaptability*: Capable of handling complex, nonlinear
relationships in data.
- *Predictive Power*: Can predict future outcomes based
on patterns in historical data.
*Disadvantages*:
- *Data Dependency*: Requires large amounts of high-
quality data for effective learning.
- *Overfitting*: Models may become too complex and
perform poorly on new data.
- *Interpretability*: Some machine learning models (e.g.,
deep learning) are difficult to interpret.
- *Resource Intensive*: Requires significant
computational resources for training models.
*c) Explain the process of data analysis.*
5 The data analysis process typically includes the
following steps:
1. *Data Collection*: Gathering relevant data from
different sources.
2. *Data Cleaning*: Handling missing data, removing
outliers, and correcting errors.
3. *Exploratory Data Analysis (EDA)*: Using statistics and
visualizations to explore and understand data.
4. *Modeling*: Applying statistical or machine learning
models to identify patterns and relationships.
5. *Interpretation*: Drawing insights from the analysis
and making data-driven decisions.
6. *Communication*: Presenting the results of the
analysis through reports, charts, or dashboards.
*d) Explain probability distribution modeling.*
Probability distribution modeling is the process of using
probability distributions to model and analyze uncertain
data. It helps in understanding the likelihood of different
outcomes. Common probability distributions include:
- *Normal Distribution*: Represents data that follows a
bell curve (e.g., heights, test scores).
- *Binomial Distribution*: Used for binary outcomes (e.g.,
success/failure).
- *Poisson Distribution*: Models the number of events
occurring in a fixed interval of time or space.
- *Exponential Distribution*: Models the time between
events in a Poisson process.

6*e) Explain applications of big data.*


Big data has numerous applications across various
industries:
- *Healthcare*: Analyzing patient data to improve
diagnoses and treatment plans.
- *Finance*: Detecting fraud, optimizing investments, and
risk management.
- *Retail*: Personalizing marketing strategies and
managing inventory.
- *Transportation*: Analyzing traffic patterns and
optimizing routes.
- *Social Media*: Sentiment analysis and user behavior
prediction.
- *Government*: Enhancing public services, policy-
making, and disaster management.
[7:29 am, 05/12/2024] Avdhut Pawar: ### Q3) Attempt
any FOUR of the following.
*a) State advantages and disadvantages of SVM.*
*Advantages*:
1. *Effective in high-dimensional spaces*: SVM is
particularly effective when the number of features is
larger than the number of data points.
2. *Robust to overfitting*: Especially in high-dimensional
space, SVMs tend to avoid overfitting by focusing on the
margins between classes.
7 3. *Versatile*: Can be used for both classification and
regression tasks.
4. *Works well with non-linear data*: SVM can handle
non-linearly separable data through kernel trick.
*Disadvantages*:
1. *Computationally expensive*: Training an SVM can be
computationally intensive, especially with large datasets.
2. *Difficult to interpret*: The resulting models are often
hard to interpret compared to decision trees.
3. *Sensitive to noise*: In the case of overlapping classes
or noisy data, SVM performance can degrade.
4. *Choice of kernel*: The performance of SVM depends
heavily on selecting the right kernel function.
*b) Explain Data frame with example.*
A *data frame* is a two-dimensional, tabular data
structure in R that is used to store data. It can contain
different types of variables (numeric, character, logical) in
columns, with each column representing a variable and
each row representing an observation. Data frames are
widely used in data analysis.
*Example*:
R
# Creating a data frame in R
data <- data.frame(
Name = c("John", "Alice", "Bob"),
Age = c(25, 30, 22),
8 Salary = c(50000, 60000, 45000)
)
print(data)
Output:
Name Age Salary
1 John 25 50000
2 Alice 30 60000
3 Bob 22 45000

In the above example, the data frame has 3 columns:


Name, Age, and Salary, with 3 rows of data.
*c) Explain types of regression models.*
There are several types of regression models used in data
analysis:
1. *Linear Regression*: Predicts a continuous target
variable based on the linear relationship between the
target and independent variables.
- Formula: \( y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \
dots + \epsilon \)
2. *Multiple Linear Regression*: Extension of linear
regression, where multiple independent variables are
used to predict the target.
3. *Logistic Regression*: Used for binary classification
problems where the target variable is categorical. It
predicts the probability of the binary outcome.
9 - Formula: \( p = \frac{1}{1 + e^{-(\beta_0 + \
beta_1x_1)}} \)
4. *Polynomial Regression*: A form of regression where
the relationship between the independent variable and
the dependent variable is modeled as an nth-degree
polynomial.
5. *Ridge and Lasso Regression*: Types of linear
regression that include regularization (penalty) to avoid
overfitting.
- *Ridge*: L2 regularization.
- *Lasso*: L1 regularization.
6. *Decision Tree Regression*: Uses a tree-like model of
decisions to predict continuous target values based on
feature values.
*d) What is histogram? Explain with example in R.*
A *histogram* is a graphical representation of the
distribution of numerical data, where the data is divided
into bins (intervals), and the frequency of data points
within each bin is plotted.
*Example in R*:
R
# Creating a simple histogram in R
data <- c(10, 12, 15, 16, 20, 22, 25, 28, 30, 35, 40)
hist(data, main="Histogram Example", xlab="Values",
ylab="Frequency", col="blue", border="black")
This code will generate a histogram for the data vector,
displaying the distribution of the values.
*e) Explain functions included in "dplyr" package.*
10The dplyr package in R is used for data manipulation.
Some key functions in dplyr include:
1. *select()*: Selects specific columns from a data frame.
- Example: select(data, column1, column2)
2. *filter()*: Filters rows based on conditions.
- Example: filter(data, Age > 25)
3. *arrange()*: Sorts the data based on one or more
columns.
- Example: arrange(data, Age)
4. *mutate()*: Creates new columns or modifies existing
columns in a data frame.
- Example: mutate(data, AgeInMonths = Age * 12)5.
*summarize()*: Summarizes data by calculating statistics
like mean, sum, etc.
- Example: summarize(data, meanAge = mean(Age))
6. *group_by()*: Groups the data by one or more
variables, useful for performing operations on subsets.
- Example: group_by(data, Gender)
7. *left_join()*: Joins two data frames by a common
column.
- Example: left_join(df1, df2, by = "ID")
---### Q4) Attempt any FOUR of the following.
*a) Explain Naive Bayes with the help of example.*
*Naive Bayes* is a classification algorithm based on
Bayes' Theorem, assuming independence between the
features. It 11calculates the probability of each class
given the features and chooses the class with the highest
probability.
*Example*:
If we have a dataset of emails labeled as "spam" or "not
spam" and features like "contains 'win'", "contains 'free'",
etc., Naive Bayes would compute the probability of each
email being spam or not based on the frequency of these
features in the respective classes.
For instance, if we have:
- P(Spam) = 0.4, P(Not Spam) = 0.6
- P('win' | Spam) = 0.3, P('win' | Not Spam) = 0.1
Using Bayes' Theorem:
\[
P(Spam | 'win') = \frac{P('win' | Spam) \times P(Spam)}
{P('win')}
\]
We compute the posterior probability for both classes and
predict the class with the higher probability.
*b) What is data visualization? Explain with example in
R.*
*Data visualization* is the graphical representation of
data to help users understand trends, patterns, and
insights from data. It involves using charts, graphs, and
plots to present information in a visually appealing and
easy-to-interpret manner.
*Example in R*:
R
12# Creating a simple bar chart in R
data <- c(10, 20, 30, 40, 50)
names(data) <- c("A", "B", "C", "D", "E")
barplot(data, main="Bar Chart Example", col="red",
xlab="Categories", ylab="Values")
This code will generate a bar chart showing the values of
different categories (A, B, C, D, E).
*c) Write a R program to accept temperatures in
Fahrenheit (F) and print it in Celsius (C).*
R
# Program to convert Fahrenheit to Celsius
fahrenheit <- as.numeric(readline(prompt="Enter
temperature in Fahrenheit: "))
celsius <- (fahrenheit - 32) * 5/9
cat(fahrenheit, "Fahrenheit is", celsius, "Celsius")
This code prompts the user to input a temperature in
Fahrenheit, converts it to Celsius, and displays the result.
*d) Accept three dimensions length (l), breadth (b) and
height (h) of a cuboid and print its volume.*
R
# Program to calculate the volume of a cuboid
length <- as.numeric(readline(prompt="Enter length: "))
breadth <- as.numeric(readline(prompt="Enter breadth:
"))
height <- as.numeric(readline(prompt="Enter height: "))
volume <- length * breadth * height
13cat("The volume of the cuboid is:", volume)
This code accepts the dimensions of the cuboid and
calculates the volume using the formula \( V = l \times b \
times h \).
*e) Write a R program to accept any year as input and
check whether the year is a leap year or not.*
R
# Program to check if a year is a leap year
year <- as.numeric(readline(prompt="Enter a year: "))
if ((year %% 4 == 0 && year %% 100 != 0) || (year %%
400 == 0)) {
cat(year, "is a leap year.")
} else {
cat(year, "is not a leap year.")
}
This program checks the conditions for a leap year
(divisible by 4, and not divisible by 100 unless divisible by
400).
### Q5) Write a short note on Any TWO of the following.
*a) Tools used in Big Data.*
1. *Hadoop*: A framework for distributed storage and
processing of large datasets using a cluster of computers.
It includes components like HDFS (Hadoop Distributed
File System) and MapReduce.
2. *Spark*: An in-memory computing engine for big data
processing, often used as an alternative to Hadoop for
faster data analysis.
143. **NoSQL Dat
abases**: Examples include MongoDB, Cassandra, and
HBase, designed to handle unstructured or semi-
structured data in big data environments.
4. *Tableau*: A data visualization tool that helps to create
interactive and shareable dashboards from big data
sources.

*b) Advantages of Big Data.*


1. *Improved Decision Making*: With access to a large
volume of data, businesses can make more informed
decisions based on trends, insights, and predictive
analysis.
2. *Cost Efficiency*: Big data tools like Hadoop enable
processing of data in a distributed manner, lowering the
cost of storage and processing.
3. *Better Customer Insights*: By analyzing big data,
companies can gain insights into customer behavior,
improving targeting, personalization, and service
delivery.
4. *Innovation*: Big data can provide new opportunities
for innovation, including new business models and
solutions.
*c) Advantages and Disadvantages of EM algorithms.*
*Advantages*:
1. *Works with incomplete data*: EM (Expectation-
Maximization) is effective for problems with missing or
incomplete data.
2. *Versatile*: It can be applied to a variety of models,
including mixture models and hidden Markov models.
15 3. *Finds maximum likelihood estimates*: EM finds
the maximum likelihood estimates for the parameters of
statistical models.
*Disadvantages*:
1. *Local maxima*: EM may converge to local maxima
instead of the global maximum, which can lead to
suboptimal solutions.
2. *Computationally expensive*: It requires multiple
iterations, which can be resource-intensive.
3. *Sensitive to initial values*: The performance of EM
can be heavily ainfluenced by the initial parameter
guesses.

You might also like