0% found this document useful (0 votes)
107 views15 pages

Bda Answers

The document discusses various aspects of data analytics, including the role of IoT devices in capturing real-time data, the significance of web data in analytical architecture, and the stages of the Data Analytics Lifecycle. It also covers topics such as data types, hypothesis testing, and the importance of cloud storage in managing big data. Additionally, it highlights the challenges and strategies in data preparation and the impact of effective communication of results on business decisions.

Uploaded by

doyav43455
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views15 pages

Bda Answers

The document discusses various aspects of data analytics, including the role of IoT devices in capturing real-time data, the significance of web data in analytical architecture, and the stages of the Data Analytics Lifecycle. It also covers topics such as data types, hypothesis testing, and the importance of cloud storage in managing big data. Additionally, it highlights the challenges and strategies in data preparation and the impact of effective communication of results on business decisions.

Uploaded by

doyav43455
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Short Answer Questions (2 Marks)

1. Role of IoT devices in capturing real-time data for analytics.

IoT devices play a crucial role in capturing real-time data for analytics by acting as
sensors and data collection points. They gather data from their environment or
connected systems and transmit it for processing and analysis. This real-time data
enables immediate insights and decision-making.

2. Web data contributes to the current analytical architecture.

Web data is a significant contributor to the current analytical architecture. It provides


valuable information about user behavior, preferences, and trends, which can be
analyzed to improve business strategies and decision-making.

3. Key stages involved in the Data Analytics Lifecycle.

The key stages involved in the Data Analytics Lifecycle include data discovery, data
preparation, model planning, model building, operationalize, and communicate
results.

4. Graphical User Interface (GUI) in R

R's Graphical User Interface (GUI) simplifies the data analysis process, making it
more accessible for both beginners and advanced users. It provides an easier way to
interact with the software and perform analyses.

5. Comparison of exploration and data presentation.

Data exploration involves investigating data to understand its characteristics and


identify patterns, while data presentation focuses on communicating the findings of
the analysis in a clear and concise manner.

6. Extend the purpose of ranking tests in hypothesis evaluation.

Ranking tests in hypothesis evaluation help compare multiple variables and prioritize
factors that significantly affect the outcome.

7. TF-IDF in text analysis.

TF-IDF (Term Frequency-Inverse Document Frequency) is a method used in text


analysis to determine the importance of a term within a collection of documents.

8. Example application of text analytics in real-world scenarios.

Text analytics can be applied in real-world scenarios such as analyzing customer


reviews to detect common issues or sentiments.

9. Difference between structured, semi-structured, and unstructured data with


examples.
o Structured data is organized in a specific format (e.g., tables in a database).
o Semi-structured data has some organization but lacks a rigid format (e.g.,
XML, JSON).
o Unstructured data has no predefined format (e.g., text documents, audio,
video).
10. Concept of the emerging big data ecosystem.

The emerging big data ecosystem refers to the evolving landscape of technologies,
tools, and practices for managing and analyzing large and complex datasets.

11. Comparison of data import and data export in R.

Data import in R involves reading data from external sources into R for analysis,
while data export is the process of saving data from R to external files or systems.

12. Communicate Results' phase in the data analytics lifecycle

The "Communicate Results" phase in the data analytics lifecycle is where the findings
of the analysis are presented to stakeholders in a clear and understandable way.

13. Purpose of visualizing a single variable in data analysis.

Visualizing a single variable in data analysis helps to understand its distribution,


identify outliers, and detect patterns or anomalies.

14. ANOVA test in statistical evaluation

ANOVA (Analysis of Variance) is a statistical test used to compare means across


different groups in a dataset.

15. "Text Analysis".

Text analysis is the process of extracting meaningful information from unstructured


text data.

16. "Categorizing documents by topics" mean in text analytics

Categorizing documents by topics in text analytics refers to the technique of


organizing a collection of documents into groups based on their content.

17. Video and audio data from embedded systems contribute to big data generation.

Video and audio data from embedded systems are significant contributors to big data
generation.

18. Cloud storage supports big data analytics.

Cloud storage provides scalable and accessible storage solutions that are essential for
big data analytics.
19. "Big Data" in the context of analytics

In the context of analytics, "Big Data" refers to extremely large and complex datasets
that require specialized tools and techniques to analyze and extract meaningful
insights.

20. Any two attributes of data types in R.

Two attributes of data types in R are their class (e.g., numeric, character) and their
structure (e.g., vector, matrix, data frame).

21. Importance of visualizing data before performing analysis.

Visualizing data before performing analysis is important for gaining initial insights,
identifying potential issues, and guiding the subsequent analysis.

22. Significance of the difference of means test in data analysis

The difference of means test is significant in data analysis as it allows for comparing
the average values of two groups to determine if there is a statistically significant
difference between them.

23. Any two benefits of using Hadoop for data analytics.

Two benefits of using Hadoop for data analytics are its ability to handle large datasets
and its distributed storage and processing capabilities.

24. Any two steps involved in the text analysis process

Two steps involved in the text analysis process are text collection and pre-processing.

25. Transactional data sources play a role in the big data ecosystem

Transactional data sources are important components of the big data ecosystem,
providing data on business transactions and operations.

26. Analyst's perspective when choosing data repositories for big data projects.

From an analyst's perspective, when choosing data repositories for big data projects,
factors like scalability, accessibility, and the ability to handle heterogeneous data are
important.

27. Importance of Big Data Analytics in industry verticals like healthcare or retail.

Big Data Analytics is important in industry verticals like healthcare and retail for
applications such as improving patient care, personalizing customer experiences, and
optimizing operations.

28. Rephrase 'State of the Practice' in analytics?


"State of the Practice" in analytics refers to the current methods, tools, and techniques
that are being used in the field.

29. Role of hypothesis testing in statistical evaluation.

The role of hypothesis testing in statistical evaluation is to validate assumptions about


a dataset before proceeding with advanced analysis.

30. Multiple variables help in understanding data relationships.

Analyzing multiple variables simultaneously helps in understanding complex


relationships within data.

31. MapReduce processes data in Hadoop.

MapReduce is a programming model used to process large datasets in Hadoop.

32. Example application of text analytics in real-world scenarios.

Text analytics can be used to analyze customer reviews to detect common issues or
sentiments.

Paragraph Answer Questions (4 Marks)

1. Different kinds of data structures used in Big Data systems

Big data systems utilize various data structures to efficiently store and manage large
volumes of data. These include structured data structures like relational databases,
semi-structured data structures like XML and JSON, and unstructured data formats
such as text documents, images, and videos. Each type has its own characteristics and
is suitable for different kinds of data and analysis.

2. Descriptive statistics in R in the discovery phase of the Data Analytics Lifecycle.

In the discovery phase of the Data Analytics Lifecycle, descriptive statistics in R are
used to summarize and understand the main features of a dataset. This involves
calculating measures like mean, median, mode, standard deviation, and variance to
gain insights into the data's distribution, central tendency, and variability.

3. ANOVA (Analysis of Variance) to compare means across different groups in a


dataset, using an example scenario.

ANOVA is used to compare the means of two or more groups. For example, in a
study comparing the effectiveness of three different teaching methods, ANOVA can
determine if there is a statistically significant difference in the average test scores of
students taught by each method.

4. TF-IDF method to a given text dataset


The TF-IDF method is applied to a text dataset to evaluate the importance of words in
each document relative to the entire corpus. It calculates the Term Frequency (TF) of
a word in a document and the Inverse Document Frequency (IDF) of the word across
all documents. Words with high TF-IDF scores are considered important for
characterizing the content of a document.

5. Emerging Big Data ecosystem influences the design and deployment of IoT-
based data capturing systems in smart cities

The emerging Big Data ecosystem significantly influences the design and deployment
of IoT-based data capturing systems in smart cities. It provides the infrastructure and
tools to handle the massive data volumes generated by IoT devices, enabling
applications like traffic management, energy efficiency, and public safety.

6. Using R, export a cleaned dataset after performing data preparation steps

In R, a cleaned dataset can be exported using functions like write.csv() or


write.table(). After performing data preparation steps such as handling missing
values, removing duplicates, and transforming variables, these functions allow you to
save the resulting dataset to a file for further analysis or use.

7. Sampling techniques to prepare a representative subset of a large dataset for


Exploratory Data Analysis, and explain its importance.

Sampling techniques are used to select a representative subset of a large dataset for
Exploratory Data Analysis (EDA). This is important because it reduces the
computational complexity and time required for analysis while still providing
meaningful insights into the overall dataset. Common sampling methods include
simple random sampling, stratified sampling, and cluster sampling.

8. Hadoop framework to a large dataset and its distributed storage and processing
capabilities improves the efficiency of data analytics.

Applying the Hadoop framework to a large dataset leverages its distributed storage
(HDFS) and processing (MapReduce) capabilities. This allows for parallel processing
of data across multiple nodes, significantly improving the efficiency of data analytics
by reducing processing time and enabling the analysis of massive datasets.

9. Data from embedded systems, such as audio and spectral data, can be captured
and transmitted through different stages to cloud storage

Data from embedded systems, including audio and spectral data, can be captured and
transmitted through various stages to cloud storage. These stages typically involve
data acquisition by the embedded system, preprocessing, transmission over a network,
and storage in the cloud, enabling further analysis and processing.

10. Steps of the Data Analytics Lifecycle to a real-world case study scenario of your
choice (for example, predicting customer churn)

In a customer churn prediction scenario, the Data Analytics Lifecycle would involve:
o Data Discovery: Gathering customer data from various sources.
o Data Preparation: Cleaning and transforming the data.
o Model Planning: Selecting appropriate prediction models.
o Model Building: Training and testing the chosen model.
o Operationalize: Deploying the model to predict churn.
o Communicate Results: Presenting the findings to stakeholders.
11. Hypothesis testing to validate assumptions in a dataset before proceeding with
advanced analysis.

Hypothesis testing is used to validate assumptions about a dataset before proceeding


with advanced analysis. For example, testing whether the data follows a normal
distribution or if there is a significant difference between the means of two groups
helps ensure that the subsequent analysis is based on sound assumptions.

12. Text classification technique to categorize a collection of documents by topic

Text classification techniques categorize a collection of documents by topic.


Algorithms like Naive Bayes, Support Vector Machines, or deep learning models are
trained to assign predefined categories to documents based on their content.

13. Data from embedded systems, such as audio and spectral data, can be captured
and transmitted through different stages to cloud storage.

Data from embedded systems, including audio and spectral data, is captured and
transmitted through stages to cloud storage, involving acquisition, preprocessing,
transmission, and storage for analysis.

14. Data analysis using R by importing a dataset, checking its attributes and data
types, and generating descriptive statistics.

Data analysis using R involves importing a dataset using functions like read.csv(),
checking attributes and data types with functions like str() and class(), and
generating descriptive statistics using functions like summary() and describe().

15. Ranking tests to compare multiple variables and explain how they help in
prioritizing factors affecting the outcome.

Ranking tests compare multiple variables and help prioritize factors affecting the
outcome by assigning ranks based on their relative importance. This allows analysts
to identify the most significant variables influencing the results.

16. Use of the MapReduce paradigm in processing text data, explaining mapper and
reducer work together to analyze text files.

The MapReduce paradigm processes text data by dividing the task into two main
phases: the mapper phase, which processes chunks of the text and emits key-value
pairs, and the reducer phase, which aggregates the results from the mapper to produce
the final output. This parallel processing enables efficient analysis of large text files.
Long Answer Questions (7 Marks)

1. Transactional and web data in understanding consumer behavior and enhancing


business intelligence.

Transactional data (e.g., sales records) and web data (e.g., website clicks) are crucial
for understanding consumer behavior and enhancing business intelligence.
Transactional data reveals purchase patterns and customer history, while web data
provides insights into online behavior and preferences. Analyzing these data types
together enables businesses to gain a comprehensive view of their customers,
personalize marketing efforts, and improve decision-making.

2. Critically analyze the emerging Big Data ecosystem

The emerging Big Data ecosystem is characterized by rapid growth in data volume,
variety, and velocity. It offers powerful tools and technologies for data storage,
processing, and analysis, enabling organizations to extract valuable insights.
However, it also presents challenges such as data governance, security concerns, and
the need for skilled professionals to manage and analyze the data effectively.

3. Role of cloud storage in managing the massive volume of data generated by IoT
devices,

Cloud storage plays a vital role in managing the massive volume of data generated by
IoT devices. It provides scalable and cost-effective storage solutions that can
accommodate the continuous influx of data from numerous devices. Cloud storage
also offers accessibility and facilitates data sharing and collaboration, enabling
efficient analysis and utilization of IoT data.

4. Comparison of Traditional analytical architectures with modern Big Data


architectures, focusing on scalability, flexibility, and data variety handling.

Traditional analytical architectures are often limited in scalability, flexibility, and


their ability to handle the variety of data types found in Big Data. Modern Big Data
architectures, on the other hand, are designed to be highly scalable, flexible, and
capable of processing structured, semi-structured, and unstructured data. They
leverage distributed computing and storage to handle large volumes of data and
complex analyses.

5. Comparison of different data structures (structured, semi-structured,


unstructured) with respect to their suitability in Big Data analytics.

Structured data is well-suited for traditional database systems and analytics but may
struggle with the volume and variety of Big Data. Semi-structured data offers more
flexibility but requires parsing and processing. Unstructured data presents the biggest
challenge due to its lack of a predefined format, necessitating specialized tools and
techniques for analysis. The choice of data structure depends on the specific analytical
needs and the nature of the data.
6. Drivers of Big Data in the current analytical architecture and how they influence
business decision-making.

The main drivers of Big Data in the current analytical architecture are the increasing
volume, velocity, and variety of data. These drivers influence business decision-
making by providing deeper insights, enabling real-time analysis, and supporting
more accurate predictions. Big Data analytics helps organizations to identify trends,
optimize operations, and gain a competitive advantage.

7. Various stages involved in transmitting data from embedded systems to cloud


storage, and the potential security concerns at each stage.

The stages involved in transmitting data from embedded systems to cloud storage
include data acquisition, preprocessing, transmission, and storage. Potential security
concerns at each stage include unauthorized access, data breaches, and data
tampering. It is crucial to implement robust security measures such as encryption,
authentication, and access control to protect the data throughout the transmission
process.

8. The integration of audio, video, and spectral data enhances predictive analytics
in modern data ecosystems.

The integration of audio, video, and spectral data enhances predictive analytics in
modern data ecosystems by providing richer and more detailed information. These
data types can reveal patterns and insights that may not be apparent in traditional data,
leading to more accurate predictions in various applications.

9. Embedded systems contribute to Big Data generation, with examples from video,
audio, and spectral data sources.

Embedded systems significantly contribute to Big Data generation. For example,


video data from surveillance cameras, audio data from microphones, and spectral data
from sensors in industrial equipment all generate large volumes of data that can be
used for various analytical purposes.

1. Challenges and strategies involved in the data preparation stage of the analytics
lifecycle.

The data preparation stage of the analytics lifecycle involves several challenges,
including data cleaning, data integration, and data transformation. Strategies to
address these challenges include using data quality tools, establishing data governance
processes, and employing techniques for handling missing values and outliers.
Effective data preparation is crucial for ensuring the accuracy and reliability of
subsequent analyses.

2. Effectively communicating results in Big Data analytics and its impact on


business decisions.
Effectively communicating results in Big Data analytics is essential for translating
complex findings into actionable insights. This involves using clear and concise
language, visualizations, and storytelling techniques to convey the key messages to
stakeholders. Effective communication ensures that the analysis informs business
decisions and drives positive outcomes.

3. Comparison of the different stages of the data analytics lifecycle and analyze
how they contribute to the development of successful data models.

The data analytics lifecycle consists of several stages, including data discovery, data
preparation, model planning, model building, operationalize, and communicate
results. Each stage contributes to the development of successful data models by
providing a structured approach to data analysis. From understanding the data to
deploying the model, each phase plays a critical role.

4. Analyze the operationalization phase

The operationalization phase involves deploying the developed model into a


production environment. This includes integrating the model with existing systems,
automating the analysis process, and monitoring the model's performance. Successful
operationalization ensures that the model provides ongoing value to the organization.

5. State of the practice of analytics today and explain how it supports decision-
making across different sectors.

The state of the practice of analytics today is characterized by the increasing use of
advanced techniques such as machine learning and artificial intelligence. Analytics
supports decision-making across different sectors by providing insights into customer
behavior, optimizing operations, and identifying new opportunities.

6. Comparison of the various data types and attributes in R, and analyze their
significance in accurate data modeling.

R supports various data types, including numeric, character, and logical, as well as
data structures like vectors, matrices, and data frames. The appropriate use of these
data types and attributes is significant for accurate data modeling, as it ensures that
the data is represented and processed correctly.

7. The model planning phase shapes the selection of analytical techniques and tools
in a Big Data project.

The model planning phase shapes the selection of analytical techniques and tools in a
Big Data project. This phase involves defining the objectives of the analysis,
identifying the relevant data sources, and choosing the appropriate methods for
modeling and analysis. Careful planning ensures that the project is aligned with the
business goals and that the chosen techniques are suitable for the data.

8. The role of initial data analysis using R in detecting data quality issues before
advanced modeling.
Initial data analysis using R plays a crucial role in detecting data quality issues before
advanced modeling. By exploring the data, analysts can identify missing values,
outliers, and inconsistencies that could affect the accuracy of the models. R provides
functions for data cleaning and validation, enabling analysts to address these issues
early in the process.

9. Role of Big Data Analytics in industry verticals such as healthcare, finance, and
retail, highlighting specific use cases.

Big Data Analytics plays a significant role in industry verticals such as healthcare,
finance, and retail. In healthcare, it can be used for patient diagnosis and treatment
optimization. In finance, it can aid in fraud detection and risk management. In retail, it
can enhance customer experience and supply chain management.

10. Process of model building in the analytics lifecycle and discuss how iterative
testing improves model accuracy.

The process of model building in the analytics lifecycle involves selecting an


appropriate algorithm, training the model on the data, and evaluating its performance.
Iterative testing, where the model is refined based on feedback, improves model
accuracy by identifying and correcting errors or biases.

11. Application of R as a tool for initial data analysis, focusing on its capabilities for
data import, export, and handling various data types.

R is a powerful tool for initial data analysis, offering capabilities for data import,
export, and handling various data types. It allows analysts to read data from different
sources, manipulate and transform the data, and export the results for further use. R's
flexibility and extensive libraries make it suitable for a wide range of analytical tasks.

12. R's graphical user interface simplifies the data analysis process for beginners
and advanced users alike.

R's graphical user interface (GUI) simplifies the data analysis process for both
beginners and advanced users. It provides an easier way to interact with the software,
making it more accessible and user-friendly. The GUI can help users perform tasks
more efficiently and reduce the learning curve associated with R.

13. The importance of the discovery phase in the analytics lifecycle and how it
influences the outcome of an analytics project.

The discovery phase is crucial in the analytics lifecycle as it sets the foundation for
the entire project. This phase involves understanding the business problem,
identifying data sources, and exploring the data to gain initial insights. The outcome
of this phase significantly influences the subsequent stages and the overall success of
the project.

14. A case study of Big Data analytics (e.g., fraud detection in banking) by applying
the stages of the data analytics lifecycle.
In a fraud detection case study in banking, the data analytics lifecycle would involve:

o Data Discovery: Gathering transaction data and customer information.


o Data Preparation: Cleaning and transforming the data.
o Model Planning: Selecting fraud detection algorithms.
o Model Building: Training and testing the model.
o Operationalize: Deploying the model to detect fraud in real-time.
o Communicate Results: Reporting fraud trends to stakeholders.
15. Descriptive statistics in R help in understanding data distributions and
identifying initial patterns.

Descriptive statistics in R are used to summarize and understand data distributions


and identify initial patterns. Measures such as mean, median, mode, and standard
deviation provide insights into the central tendency and variability of the data.

16. R's graphical user interface simplifies the data analysis process for beginners
and advanced users alike.

R's graphical user interface simplifies the data analysis process for both beginners and
advanced users, making the software more accessible and user-friendly.

CO3 (7 Marks)

1. Visualization techniques to examine the relationship between multiple variables


in a dataset.

Visualization techniques such as scatter plots, heatmaps, and parallel coordinate plots
can be used to examine the relationship between multiple variables in a dataset. These
techniques help to identify correlations, patterns, and trends, providing insights into
how variables interact with each other.

2. Test for the steps to conduct a ranking test on survey data to identify customer
preferences for a new product.

Steps to conduct a ranking test on survey data to identify customer preferences for a
new product include:

oCollecting survey data where customers rank product features.


oAssigning numerical values to the ranks.
oCalculating the average rank for each feature.
oRanking the features based on their average rank.
oInterpreting the results to determine customer preferences.
3. Theme to detect and handle outliers during exploratory data analysis

Outliers can be detected and handled during exploratory data analysis using
techniques such as visualization (e.g., box plots, scatter plots), statistical methods
(e.g., z-score, IQR), and domain knowledge. Handling outliers may involve removing
them, transforming them, or imputing them, depending on the reason for their
presence and their potential impact on the analysis.
4. Categorize the sampling methods to reduce the computational complexity of
exploratory data analysis for a large dataset.

Sampling methods to reduce the computational complexity of exploratory data


analysis for a large dataset can be categorized into:

o Probability sampling: Simple random sampling, stratified sampling, cluster


sampling.
o Non-probability sampling: Convenience sampling, purposive sampling.

Probability sampling methods ensure that each element has a known probability of
being included in the sample, while non-probability methods are often used for
exploratory purposes when representativeness is less critical.

5. Steps to visualize a single variable in a dataset and interpret what patterns or


anomalies can be detected.

Steps to visualize a single variable in a dataset include:

o Choosing an appropriate visualization (e.g., histogram, bar chart, box plot).


o Plotting the data.
o Interpreting the visualization to identify the distribution, central tendency,
variability, and any patterns or anomalies such as outliers or skewness.
6. Visualization techniques to detect patterns and trends in time-series data during
Initial

Visualization techniques to detect patterns and trends in time-series data include line
charts, area charts, and decomposition plots. These techniques help to identify
seasonality, trends, and cyclical patterns, as well as anomalies or outliers in the data
over time.

7. Difference of means test (e.g., t-test) to compare two groups in a dataset and
interpret the result.

A difference of means test, such as the t-test, is used to compare the means of two
groups in a dataset. The result of the test indicates whether the difference between the
means is statistically significant, meaning it is unlikely to have occurred by chance.
The interpretation involves considering the p-value and the confidence interval to
assess the strength and reliability of the finding.

8. Hypothesis testing can prevent erroneous conclusions in exploratory data


analysis with an example.

Hypothesis testing can prevent erroneous conclusions in exploratory data analysis by


providing a framework for validating assumptions and determining the statistical
significance of observed patterns. For example, testing the hypothesis that two
samples have the same mean can prevent concluding that there is a difference when
the observed difference is likely due to random variation.
9. Steps to handle dirty data in an exploratory data analysis and explain the impact
of unclean data on analysis outcomes.

Steps to handle dirty data in exploratory data analysis include:

o Identifying dirty data (e.g., missing values, duplicates, inconsistencies).


o Cleaning the data (e.g., imputing missing values, removing duplicates,
correcting errors).
o Validating the cleaned data.

Unclean data can lead to inaccurate results, biased conclusions, and flawed decision-
making.

10. Comparison of data exploration and data presentation, and visual tools serve
both purposes with examples.

Data exploration involves investigating data to understand its characteristics and


identify potential patterns, while data presentation focuses on communicating the
findings of the analysis in a clear and concise manner. Visual tools like scatter plots
and histograms can serve both purposes, allowing for exploration during analysis and
effective presentation of results.

11. Process of applying sampling techniques in EDA to ensure representativeness of


the data.

The process of applying sampling techniques in EDA to ensure representativeness of


the data involves:

o Defining the population of interest.


o Choosing an appropriate sampling method.
o Determining the sample size.
o Selecting the sample.
o Verifying the representativeness of the sample.
12. ANOVA (Analysis of Variance) to compare multiple groups in a dataset and
interpret the findings.

ANOVA is used to compare the means of multiple groups in a dataset. The findings
are interpreted by examining the F-statistic and the associated p-value to determine if
there is a statistically significant difference between the group means. If the result is
significant, post-hoc tests may be used to identify which specific groups differ from
each other.

13. Visualization techniques to demonstrate how to explore a dataset before formal


analysis, and insights they can provide.

Visualization techniques to explore a dataset before formal analysis include


histograms, scatter plots, box plots, and correlation matrices. These techniques can
provide insights into data distribution, relationships between variables, outliers, and
potential issues with data quality.
14. Ranking tests can be applied in exploratory data analysis to identify top-
performing categories in a dataset.

Ranking tests can be applied in exploratory data analysis to identify top-performing


categories in a dataset by assigning ranks based on a specific metric, such as sales or
customer satisfaction. This allows for easy comparison and prioritization of
categories.

15. Application of hypothesis testing in EDA to validate initial assumptions about


the data.

Hypothesis testing can be applied in EDA to validate initial assumptions about the
data, such as testing whether the data follows a particular distribution or whether there
is a significant difference between groups. This helps to ensure that subsequent
analyses are based on sound assumptions.

16. Steps to conduct a ranking test on survey data to identify customer preferences
for a new product.

Steps to conduct a ranking test on survey data to identify customer preferences for a
new product:

o Collect survey data with ranked preferences.


o Assign numerical ranks.
o Calculate the average rank for each product.
o Rank the products based on average ranks.
o Interpret the results.

CO4 (7 Marks)

1. Process of sentiment analysis in text analytics and explain its significance in


understanding customer feedback and social media content.

Sentiment analysis in text analytics involves determining the emotional tone


expressed in text, whether it is positive, negative, or neutral. This process is
significant in understanding customer feedback and social media content as it
provides insights into customer opinions, brand perception, and public sentiment
towards products or services.

2. Use of Hadoop and MapReduce can improve the efficiency of text analysis when
processing large volumes of unstructured data.

The use of Hadoop and MapReduce can improve the efficiency of text analysis when
processing large volumes of unstructured data. Hadoop provides distributed storage
and processing capabilities, while MapReduce enables parallel processing of text data,
significantly reducing the time required for analysis.

3. Role of TF-IDF (Term Frequency-Inverse Document Frequency) in text mining


The role of TF-IDF in text mining is to evaluate the importance of a word in a
document relative to a collection of documents. It helps to identify words that are
characteristic of a document while downweighting words that are common across all
documents.

Sources and related content

You might also like