Rohan More

Uploaded by

rohanmore19105

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views16 pages

Rohan More

Uploaded by

rohanmore19105

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

1 Q1) Attempt any EIGHT of the following.

a) What is big data?

Big data refers to extremely large datasets that cannot
be easily managed, processed, or analyzed using
traditional data processing tools. These datasets are
often characterized by the three V's: Volume (large
amounts of data), Velocity (the speed at which data is
generated), and Variety (different types of data, such as
structured, semi-structured, and unstructured).
*b) What is data manipulation?*
Data manipulation involves the process of adjusting,
organizing, and transforming raw data to make it suitable
for analysis. This can include tasks like sorting, filtering,
merging datasets, or changing data types to ensure
accuracy and consistency.
*c) What is data science?*
Data science is an interdisciplinary field that combines
techniques from statistics, machine learning, data
mining, and computer science to extract meaningful
insights and knowledge from structured and unstructured
data. It involves processes such as data cleaning,
analysis, and predictive modeling.
*d) What is statistical inference?*
Statistical inference is the process of drawing conclusions
about a population based on a sample of data. This
involves using probability theory to estimate population
parameters and test hypotheses about data trends or
relationships.
*e) Enlist the stages of data science?*

2 The stages of data science typically include:

1. *Data Collection*: Gathering raw data from various
sources.
2. *Data Cleaning*: Removing inconsistencies and
handling missing values.
3. *Exploratory Data Analysis (EDA)*: Identifying patterns,
trends, and outliers in the data.
4. *Modeling*: Applying algorithms and statistical models
to the data.
5. *Evaluation*: Assessing the performance of models
using metrics.
6. *Deployment*: Implementing models into production
environments.
7. *Monitoring and Maintenance*: Continuously
evaluating the model's performance.
*f) Define Machine Learning.*
Machine Learning (ML) is a branch of artificial intelligence
that enables systems to learn from data, improve from
experience, and make predictions or decisions without
explicit programming. It involves the use of algorithms to
identify patterns and make data-driven predictions.
*g) Define SVM?*
Support Vector Machine (SVM) is a supervised machine
learning algorithm used for classification and regression
tasks. It works by finding a hyperplane that best
separates data into different classes. SVM can be used for
both linear and nonlinear classification.
3*h) What is the use of histogram?*
A histogram is a graphical representation of the
distribution of numerical data. It helps to visualize the
frequency of data points within certain ranges, providing
insights into the underlying distribution, patterns, and
outliers in the dataset.
*i) What is data analysis?*
Data analysis is the process of inspecting, cleaning,
transforming, and modeling data to discover useful
information, draw conclusions, and support decision-
making. It includes methods like statistical analysis,
exploratory data analysis, and predictive modeling.
*j) What is the use of themes?*
In data visualization, themes refer to the visual style and
design elements used to enhance the presentation of
data. Themes help to standardize the color schemes, font
styles, and layout of charts or graphs, ensuring
consistency and clarity in presenting information.
---### Q2) Attempt any FOUR of the following.
*a) Explain different types of data analytics.*
1. *Descriptive Analytics*: Focuses on summarizing
historical data to understand past behavior and trends.
2. *Diagnostic Analytics*: Investigates why something
happened by analyzing data to identify causes and
relationships.
4 3. *Predictive Analytics*: Uses historical data and
statistical models to forecast future events or trends.
4. *Prescriptive Analytics*: Suggests actions and
outcomes based on data analysis to optimize decision-
making.
*b) Give advantages and disadvantages of Machine
Learning.*
*Advantages*:
- *Automation*: Can automate repetitive tasks and
processes.
- *Accuracy*: Can improve accuracy over time with more
data.
- *Adaptability*: Capable of handling complex, nonlinear
relationships in data.
- *Predictive Power*: Can predict future outcomes based
on patterns in historical data.
*Disadvantages*:
- *Data Dependency*: Requires large amounts of high-
quality data for effective learning.
- *Overfitting*: Models may become too complex and
perform poorly on new data.
- *Interpretability*: Some machine learning models (e.g.,
deep learning) are difficult to interpret.
- *Resource Intensive*: Requires significant
computational resources for training models.
*c) Explain the process of data analysis.*
5 The data analysis process typically includes the
following steps:
1. *Data Collection*: Gathering relevant data from
different sources.
2. *Data Cleaning*: Handling missing data, removing
outliers, and correcting errors.
3. *Exploratory Data Analysis (EDA)*: Using statistics and
visualizations to explore and understand data.
4. *Modeling*: Applying statistical or machine learning
models to identify patterns and relationships.
5. *Interpretation*: Drawing insights from the analysis
and making data-driven decisions.
6. *Communication*: Presenting the results of the
analysis through reports, charts, or dashboards.
*d) Explain probability distribution modeling.*
Probability distribution modeling is the process of using
probability distributions to model and analyze uncertain
data. It helps in understanding the likelihood of different
outcomes. Common probability distributions include:
- *Normal Distribution*: Represents data that follows a
bell curve (e.g., heights, test scores).
- *Binomial Distribution*: Used for binary outcomes (e.g.,
success/failure).
- *Poisson Distribution*: Models the number of events
occurring in a fixed interval of time or space.
- *Exponential Distribution*: Models the time between
events in a Poisson process.

6e) Explain applications of big data.

Big data has numerous applications across various
industries:
- *Healthcare*: Analyzing patient data to improve
diagnoses and treatment plans.
- *Finance*: Detecting fraud, optimizing investments, and
risk management.
- *Retail*: Personalizing marketing strategies and
managing inventory.
- *Transportation*: Analyzing traffic patterns and
optimizing routes.
- *Social Media*: Sentiment analysis and user behavior
prediction.
- *Government*: Enhancing public services, policy-
making, and disaster management.
[7:29 am, 05/12/2024] Avdhut Pawar: ### Q3) Attempt
any FOUR of the following.
*a) State advantages and disadvantages of SVM.*
*Advantages*:
1. *Effective in high-dimensional spaces*: SVM is
particularly effective when the number of features is
larger than the number of data points.
2. *Robust to overfitting*: Especially in high-dimensional
space, SVMs tend to avoid overfitting by focusing on the
margins between classes.
7 3. *Versatile*: Can be used for both classification and
regression tasks.
4. *Works well with non-linear data*: SVM can handle
non-linearly separable data through kernel trick.
*Disadvantages*:
1. *Computationally expensive*: Training an SVM can be
computationally intensive, especially with large datasets.
2. *Difficult to interpret*: The resulting models are often
hard to interpret compared to decision trees.
3. *Sensitive to noise*: In the case of overlapping classes
or noisy data, SVM performance can degrade.
4. *Choice of kernel*: The performance of SVM depends
heavily on selecting the right kernel function.
*b) Explain Data frame with example.*
A *data frame* is a two-dimensional, tabular data
structure in R that is used to store data. It can contain
different types of variables (numeric, character, logical) in
columns, with each column representing a variable and
each row representing an observation. Data frames are
widely used in data analysis.
*Example*:
R
# Creating a data frame in R
data <- data.frame(
Name = c("John", "Alice", "Bob"),
Age = c(25, 30, 22),
8 Salary = c(50000, 60000, 45000)
)
print(data)
Output:
Name Age Salary
1 John 25 50000
2 Alice 30 60000
3 Bob 22 45000

In the above example, the data frame has 3 columns:

Name, Age, and Salary, with 3 rows of data.
*c) Explain types of regression models.*
There are several types of regression models used in data
analysis:
1. *Linear Regression*: Predicts a continuous target
variable based on the linear relationship between the
target and independent variables.
- Formula: \( y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \
dots + \epsilon \)
2. *Multiple Linear Regression*: Extension of linear
regression, where multiple independent variables are
used to predict the target.
3. *Logistic Regression*: Used for binary classification
problems where the target variable is categorical. It
predicts the probability of the binary outcome.
9 - Formula: \( p = \frac{1}{1 + e^{-(\beta_0 + \
beta_1x_1)}} \)
4. *Polynomial Regression*: A form of regression where
the relationship between the independent variable and
the dependent variable is modeled as an nth-degree
polynomial.
5. *Ridge and Lasso Regression*: Types of linear
regression that include regularization (penalty) to avoid
overfitting.
- *Ridge*: L2 regularization.
- *Lasso*: L1 regularization.
6. *Decision Tree Regression*: Uses a tree-like model of
decisions to predict continuous target values based on
feature values.
*d) What is histogram? Explain with example in R.*
A *histogram* is a graphical representation of the
distribution of numerical data, where the data is divided
into bins (intervals), and the frequency of data points
within each bin is plotted.
*Example in R*:
R
# Creating a simple histogram in R
data <- c(10, 12, 15, 16, 20, 22, 25, 28, 30, 35, 40)
hist(data, main="Histogram Example", xlab="Values",
ylab="Frequency", col="blue", border="black")
This code will generate a histogram for the data vector,
displaying the distribution of the values.
*e) Explain functions included in "dplyr" package.*
10The dplyr package in R is used for data manipulation.
Some key functions in dplyr include:
1. *select()*: Selects specific columns from a data frame.
- Example: select(data, column1, column2)
2. *filter()*: Filters rows based on conditions.
- Example: filter(data, Age > 25)
3. *arrange()*: Sorts the data based on one or more
columns.
- Example: arrange(data, Age)
4. *mutate()*: Creates new columns or modifies existing
columns in a data frame.
- Example: mutate(data, AgeInMonths = Age * 12)5.
*summarize()*: Summarizes data by calculating statistics
like mean, sum, etc.
- Example: summarize(data, meanAge = mean(Age))
6. *group_by()*: Groups the data by one or more
variables, useful for performing operations on subsets.
- Example: group_by(data, Gender)
7. *left_join()*: Joins two data frames by a common
column.
- Example: left_join(df1, df2, by = "ID")
---### Q4) Attempt any FOUR of the following.
*a) Explain Naive Bayes with the help of example.*
*Naive Bayes* is a classification algorithm based on
Bayes' Theorem, assuming independence between the
features. It 11calculates the probability of each class
given the features and chooses the class with the highest
probability.
*Example*:
If we have a dataset of emails labeled as "spam" or "not
spam" and features like "contains 'win'", "contains 'free'",
etc., Naive Bayes would compute the probability of each
email being spam or not based on the frequency of these
features in the respective classes.
For instance, if we have:
- P(Spam) = 0.4, P(Not Spam) = 0.6
- P('win' | Spam) = 0.3, P('win' | Not Spam) = 0.1
Using Bayes' Theorem:
\[
P(Spam | 'win') = \frac{P('win' | Spam) \times P(Spam)}
{P('win')}
\]
We compute the posterior probability for both classes and
predict the class with the higher probability.
*b) What is data visualization? Explain with example in
R.*
*Data visualization* is the graphical representation of
data to help users understand trends, patterns, and
insights from data. It involves using charts, graphs, and
plots to present information in a visually appealing and
easy-to-interpret manner.
*Example in R*:
R
12# Creating a simple bar chart in R
data <- c(10, 20, 30, 40, 50)
names(data) <- c("A", "B", "C", "D", "E")
barplot(data, main="Bar Chart Example", col="red",
xlab="Categories", ylab="Values")
This code will generate a bar chart showing the values of
different categories (A, B, C, D, E).
*c) Write a R program to accept temperatures in
Fahrenheit (F) and print it in Celsius (C).*
R
# Program to convert Fahrenheit to Celsius
fahrenheit <- as.numeric(readline(prompt="Enter
temperature in Fahrenheit: "))
celsius <- (fahrenheit - 32) * 5/9
cat(fahrenheit, "Fahrenheit is", celsius, "Celsius")
This code prompts the user to input a temperature in
Fahrenheit, converts it to Celsius, and displays the result.
*d) Accept three dimensions length (l), breadth (b) and
height (h) of a cuboid and print its volume.*
R
# Program to calculate the volume of a cuboid
length <- as.numeric(readline(prompt="Enter length: "))
breadth <- as.numeric(readline(prompt="Enter breadth:
"))
height <- as.numeric(readline(prompt="Enter height: "))
volume <- length * breadth * height
13cat("The volume of the cuboid is:", volume)
This code accepts the dimensions of the cuboid and
calculates the volume using the formula \( V = l \times b \
times h \).
*e) Write a R program to accept any year as input and
check whether the year is a leap year or not.*
R
# Program to check if a year is a leap year
year <- as.numeric(readline(prompt="Enter a year: "))
if ((year %% 4 == 0 && year %% 100 != 0) || (year %%
400 == 0)) {
cat(year, "is a leap year.")
} else {
cat(year, "is not a leap year.")
}
This program checks the conditions for a leap year
(divisible by 4, and not divisible by 100 unless divisible by
400).
### Q5) Write a short note on Any TWO of the following.
*a) Tools used in Big Data.*
1. *Hadoop*: A framework for distributed storage and
processing of large datasets using a cluster of computers.
It includes components like HDFS (Hadoop Distributed
File System) and MapReduce.
2. *Spark*: An in-memory computing engine for big data
processing, often used as an alternative to Hadoop for
faster data analysis.
143. **NoSQL Dat
abases**: Examples include MongoDB, Cassandra, and
HBase, designed to handle unstructured or semi-
structured data in big data environments.
4. *Tableau*: A data visualization tool that helps to create
interactive and shareable dashboards from big data
sources.

b) Advantages of Big Data.

1. *Improved Decision Making*: With access to a large
volume of data, businesses can make more informed
decisions based on trends, insights, and predictive
analysis.
2. *Cost Efficiency*: Big data tools like Hadoop enable
processing of data in a distributed manner, lowering the
cost of storage and processing.
3. *Better Customer Insights*: By analyzing big data,
companies can gain insights into customer behavior,
improving targeting, personalization, and service
delivery.
4. *Innovation*: Big data can provide new opportunities
for innovation, including new business models and
solutions.
*c) Advantages and Disadvantages of EM algorithms.*
*Advantages*:
1. *Works with incomplete data*: EM (Expectation-
Maximization) is effective for problems with missing or
incomplete data.
2. *Versatile*: It can be applied to a variety of models,
including mixture models and hidden Markov models.
15 3. *Finds maximum likelihood estimates*: EM finds
the maximum likelihood estimates for the parameters of
statistical models.
*Disadvantages*:
1. *Local maxima*: EM may converge to local maxima
instead of the global maximum, which can lead to
suboptimal solutions.
2. *Computationally expensive*: It requires multiple
iterations, which can be resource-intensive.
3. *Sensitive to initial values*: The performance of EM
can be heavily ainfluenced by the initial parameter
guesses.

BigDataSolution of Paper Oct 2022
No ratings yet
BigDataSolution of Paper Oct 2022
11 pages
Big Data (Imp-Questions)
No ratings yet
Big Data (Imp-Questions)
17 pages
ML Chapter 2
No ratings yet
ML Chapter 2
9 pages
Big Data
No ratings yet
Big Data
5 pages
Big Data Imp Notes of Big Dats
No ratings yet
Big Data Imp Notes of Big Dats
17 pages
Da #2
No ratings yet
Da #2
1 page
A) What Is Big Data?
No ratings yet
A) What Is Big Data?
7 pages
Revision
No ratings yet
Revision
19 pages
DS
No ratings yet
DS
7 pages
Untitled Document
No ratings yet
Untitled Document
8 pages
ADS IA 1 Syllabus Prep
No ratings yet
ADS IA 1 Syllabus Prep
5 pages
Foundation of Data Science Previous Year Question Paper
100% (1)
Foundation of Data Science Previous Year Question Paper
40 pages
Big Data Part-I
No ratings yet
Big Data Part-I
15 pages
Ba Theory
No ratings yet
Ba Theory
10 pages
Ba Notes Short
No ratings yet
Ba Notes Short
50 pages
Python Data Science Essentials
No ratings yet
Python Data Science Essentials
11 pages
Da 1733591326
No ratings yet
Da 1733591326
132 pages
Data Science
No ratings yet
Data Science
14 pages
Data Analysis and Visualization
No ratings yet
Data Analysis and Visualization
18 pages
Crack Data Science Interview 1731300339
No ratings yet
Crack Data Science Interview 1731300339
132 pages
Big Data Essentials & Challenges
No ratings yet
Big Data Essentials & Challenges
71 pages
Key Concepts in Data Science and Analytics
No ratings yet
Key Concepts in Data Science and Analytics
19 pages
Unit 2, 3
No ratings yet
Unit 2, 3
9 pages
Data Science and Analytics Theory Complete
No ratings yet
Data Science and Analytics Theory Complete
11 pages
Set. No - 2 P18pecs021-Data Science QP - Ph.d.
No ratings yet
Set. No - 2 P18pecs021-Data Science QP - Ph.d.
20 pages
Ds Revision 1
No ratings yet
Ds Revision 1
5 pages
Data Science
No ratings yet
Data Science
28 pages
Data Science Interview
No ratings yet
Data Science Interview
132 pages
Data Mining Question Bank 3,4,5
No ratings yet
Data Mining Question Bank 3,4,5
7 pages
Lecture 1
No ratings yet
Lecture 1
11 pages
Big Data Characteristics and Skills
No ratings yet
Big Data Characteristics and Skills
6 pages
DA (All CHP.)
No ratings yet
DA (All CHP.)
14 pages
Machine Learning Concepts and Applications
No ratings yet
Machine Learning Concepts and Applications
8 pages
Question Bank (DA) - 1
No ratings yet
Question Bank (DA) - 1
14 pages
Define Data Analytics and Outline Its Types and Applications
No ratings yet
Define Data Analytics and Outline Its Types and Applications
5 pages
Q1. Explain Data Science Process Along With Detailed Diagram
No ratings yet
Q1. Explain Data Science Process Along With Detailed Diagram
7 pages
Da CH1 Slqa
No ratings yet
Da CH1 Slqa
6 pages
Assignment Big Data
No ratings yet
Assignment Big Data
7 pages
FDS 1
No ratings yet
FDS 1
5 pages
FDS - 3 Solved
No ratings yet
FDS - 3 Solved
21 pages
Five V's of Big Data Explained
No ratings yet
Five V's of Big Data Explained
8 pages
Exam Preparation Notes
No ratings yet
Exam Preparation Notes
31 pages
Unit 1,2
No ratings yet
Unit 1,2
17 pages
Data Science QA
No ratings yet
Data Science QA
2 pages
Ads Imp Qna 2025 15 04 06 06 35
No ratings yet
Ads Imp Qna 2025 15 04 06 06 35
33 pages
Mining Equipment Performance Mapping
No ratings yet
Mining Equipment Performance Mapping
34 pages
Summary DS231
No ratings yet
Summary DS231
11 pages
Fds Csheet and Read The Rule
No ratings yet
Fds Csheet and Read The Rule
4 pages
Class 9 (Chap #4)
No ratings yet
Class 9 (Chap #4)
9 pages
Data Science and Analytics Reviewer
No ratings yet
Data Science and Analytics Reviewer
5 pages
Data Analytics
No ratings yet
Data Analytics
6 pages
FDS PYQ Solution
No ratings yet
FDS PYQ Solution
8 pages
Data Science
No ratings yet
Data Science
32 pages
Data Science Assignment
No ratings yet
Data Science Assignment
9 pages
Ds 1
No ratings yet
Ds 1
8 pages
Ids PDF
No ratings yet
Ids PDF
397 pages
ADS TT1 QB Solutions
No ratings yet
ADS TT1 QB Solutions
14 pages
50 Interview Questions & Answers!
No ratings yet
50 Interview Questions & Answers!
52 pages
AItheory
No ratings yet
AItheory
13 pages
Software Engineering
No ratings yet
Software Engineering
8 pages
Nord Js
No ratings yet
Nord Js
8 pages
Object Oriented Concepts Through CPP
No ratings yet
Object Oriented Concepts Through CPP
9 pages
Data Structure Updated Corections
No ratings yet
Data Structure Updated Corections
15 pages
Jaikranti College of Computer Science & Management Studies, Katraj, Pune
No ratings yet
Jaikranti College of Computer Science & Management Studies, Katraj, Pune
32 pages
VNR Report Andua&t PDF
No ratings yet
VNR Report Andua&t PDF
36 pages
Ethical Research for Novice Scholars
No ratings yet
Ethical Research for Novice Scholars
5 pages
MAMCA-005-Multi Actor Multi Criteria Analysis (MAMCA) As A Tool To Support Sustainable Decisions State of Use
No ratings yet
MAMCA-005-Multi Actor Multi Criteria Analysis (MAMCA) As A Tool To Support Sustainable Decisions State of Use
11 pages
Book Review by Azhar Kazmi
No ratings yet
Book Review by Azhar Kazmi
3 pages
Dissertation Help for Geography Students
100% (2)
Dissertation Help for Geography Students
7 pages
Construction Engineering Thesis Topics
100% (3)
Construction Engineering Thesis Topics
6 pages
Evaluation of Discharge Quality Into River Sabaki and Its Measures - A Research Proposal
No ratings yet
Evaluation of Discharge Quality Into River Sabaki and Its Measures - A Research Proposal
17 pages
Shah Oppenheimer 2008 - Heuristics
No ratings yet
Shah Oppenheimer 2008 - Heuristics
16 pages
Aditya Jog VERVE Car Case Study Presentation
No ratings yet
Aditya Jog VERVE Car Case Study Presentation
11 pages
Introduction To Nursing Research
No ratings yet
Introduction To Nursing Research
19 pages
Tuslob Buwa Sales Analysis in Consolacion
No ratings yet
Tuslob Buwa Sales Analysis in Consolacion
25 pages
Social Networking Usage Analysis
No ratings yet
Social Networking Usage Analysis
4 pages
Literature Review On Motor Insurance
100% (2)
Literature Review On Motor Insurance
4 pages
The Role of Safety-Related Criteria in Selection of
No ratings yet
The Role of Safety-Related Criteria in Selection of
25 pages
Regions and Regionalism
No ratings yet
Regions and Regionalism
170 pages
NZ Nrsi Conf Brochure 2025
No ratings yet
NZ Nrsi Conf Brochure 2025
3 pages
Marketing Information Systems Guide
No ratings yet
Marketing Information Systems Guide
16 pages
Thesis Help for Education Students
100% (2)
Thesis Help for Education Students
9 pages
Nursing Research Test Bank (20 Questions) - Nurseslabs
No ratings yet
Nursing Research Test Bank (20 Questions) - Nurseslabs
26 pages
Data Analysis for Researchers
No ratings yet
Data Analysis for Researchers
12 pages
Suburban Warriors The Origins of The New American Right Updated Edition Lisa Mcgirr Instant Download
No ratings yet
Suburban Warriors The Origins of The New American Right Updated Edition Lisa Mcgirr Instant Download
161 pages
Sample - Mettl Cognitive Abilities Assessment - 1602363894780 PDF
100% (3)
Sample - Mettl Cognitive Abilities Assessment - 1602363894780 PDF
3 pages
5 Unique HUMSS Quantitative Titles
No ratings yet
5 Unique HUMSS Quantitative Titles
3 pages
Final
No ratings yet
Final
22 pages
PR 1st Quarte Exam g12
No ratings yet
PR 1st Quarte Exam g12
3 pages
Mills & Butroyd (2014)
No ratings yet
Mills & Butroyd (2014)
24 pages
Student Perceptions of Teacher Performance
100% (1)
Student Perceptions of Teacher Performance
28 pages
English Culture and The Decline of The Industrial Spirit 1850 1980 2nd Edition Martin J. Wiener Online Version
No ratings yet
English Culture and The Decline of The Industrial Spirit 1850 1980 2nd Edition Martin J. Wiener Online Version
104 pages
Cognitive Biases in Implementing A Performance Managemnet System
No ratings yet
Cognitive Biases in Implementing A Performance Managemnet System
27 pages
Practical Research 1 - 4th Quarter LAS
80% (5)
Practical Research 1 - 4th Quarter LAS
39 pages

Rohan More

Uploaded by

Rohan More

Uploaded by

1 Q1) Attempt any EIGHT of the following.

*a) What is big data?*

2 The stages of data science typically include:

6*e) Explain applications of big data.*

In the above example, the data frame has 3 columns:

*b) Advantages of Big Data.*

You might also like

a) What is big data?

6e) Explain applications of big data.

b) Advantages of Big Data.