0% found this document useful (0 votes)
87 views6 pages

Assignment 3 Based On Unit 3

This document discusses various concepts related to time series analysis and predictive modeling. It covers decision trees and how they can be used to predict customer purchasing behavior. It also discusses Naive Bayes classification and how it uses Bayes' theorem for probabilistic classification. Finally, it discusses time series analysis, including its components of trend, seasonality, and cyclical patterns. It also outlines the Box-Jenkins methodology for time series analysis and modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views6 pages

Assignment 3 Based On Unit 3

This document discusses various concepts related to time series analysis and predictive modeling. It covers decision trees and how they can be used to predict customer purchasing behavior. It also discusses Naive Bayes classification and how it uses Bayes' theorem for probabilistic classification. Finally, it discusses time series analysis, including its components of trend, seasonality, and cyclical patterns. It also outlines the Box-Jenkins methodology for time series analysis and modeling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Assignment 3 based on Unit 3

Unit III
1. How to to predict whether customers will buy a product or not? Explain with respect to
decision tree.
i) A Decision Tree is a tree-like graph with nodes representing the place where we pick an
attribute and ask a question; edges represent the answers to the question, and the leaves
represent the actual output or class label.
ii) Figure shows an example of using a decision tree to predict whether customers will buy a
product.

iii) The term branch refers to the outcome of a decision and is visualized as a line connecting
two nodes.
iv) If a decision is numerical, the "greater than" branch is usually placed on the right, and the
"less than" branch is placed on the left.
v) Depending on the nature of the variable, one of the branches may need to include an "equal
to “component.
vi) Internal nodes are the decision or test points. Each internal node refers to an input variable
or an attribute.
vii) The top internal node is called the root. The decision tree in Figure7-1 is a binary tree in
that each internal node has no more than two branches.
viii) The depth of a node is the minimum number of steps required to reach the node from the
root. In Figure 7-1 for example, nodes Income and Age have a depth of one, and the four
nodes on the bottom of the tree have a depth of two.
ix) Leaf nodes are at the end of the last branches on the tree. They represent class labels—the
outcome of all the prior decisions.
x) The path from the root to a leaf node contains a series of decisions made at various internal
nodes.
xi) The decision tree inFigure7-1 shows that females with income less than or equal
to$45,000 and males 40years old or younger are classified as people who would purchase the
product.
xii) In traversing this tree, age does not matter for females, and income does not matter for
males.

2. Explain a probabilistic classification method based on Naive Bayes' theorem.

Hiren Parkar 22306A1031


i) Naive Bayes is a probabilistic classification method based on Bayes' theorem. Bayes'
theorem gives the relationship between the probabilities of two events and their conditional
probabilities.
ii) A naive Bayes classifier assumes that the presence or absence of a particular feature of a
class is unrelated to the presence or absence of other features. For example, an object can
be classified based on its attributes such as shape, colour, and weight.
iii) The input variables are generally categorical, but variations of the algorithm can accept
continuous variables, there are also ways to convert continuous variables into categorical
ones. This process is often referred to as the discretization of continuous variables.
iv) For an attribute such as income, the attribute can be converted into categorical values as
shown below.
• Low Income: income < $10,000
• Working Class: $10,000 < income < $50,000
• Middle Class: $50,000 < income < $1,000,000
• Upper Class: income >$1,000,000
v) The output typically includes a class label and its corresponding probability score. The
probability score is not the true probability of the class label, but it's proportional to the true
probability.
Vi) Application
a) Spam filtering is a classic use case of naive Bayes text classification. Bayesian spam
filtering has become a popular mechanism to distinguish spam e-mail from legitimate e-
mail.
b) Naive Bayes classifiers can also be used for fraud detection. In the domain of auto
insurance, for example, based on a training set with attributes such as driver's rating,
vehicle age, vehicle price, historical claims by the policy holder, police report status, and
claim genuineness, naive Bayes can provide probability- based classification of whether
a new claim is genuine.

vii) The conditional probability of event C occurring, given that event A has already occurred,
is denoted as P(C|A), which can be found using the formula in Equation 5-6.

Equation 5-7 can be obtained with some minor algebra and substitution of the conditional
probability.

Where c is the class label and A is observed attributes


Equation 5-7 is the most common form of the Baye’s theorem.
viii) Mathematically, Bayes’ theorem gives the relationship between the probabilities of C
and A, P(C) and P(A), and the conditional probabilities of C given A and A, given C, namely
P(C/A) and P(A/C)

3. How to model a structure of observations taken over time? Explain with respect to Time
series analysis. Also explain any two of its applications.

Hiren Parkar 22306A1031


i) Time series analysis attempts to model the underlying structure of observations taken
over time, A time series, denoted Y = a + bX , is an ordered sequence of equally spaced
values over time.
ii) For example, Figure 6-1 provides a plot of the monthly number of international airline
passengers over a 12-year period. In this example, the time series consists of an ordered
sequence of 144 values.

iii) Following are the goals of time series analysis:


• Identify and model the structure of the time series.
• Forecast future values in the time series.
iv) Time series analysis has many applications in finance, economics, biology, engineering,
retail, and manufacturing.
1) Retail sales: For various product lines, a clothing retailer is looking to forecast future
monthly sales. These forecasts need to account for the seasonal aspects of the
customer's purchasing decisions.
2) Stock trading: Some high-frequency stock traders utilize a technique called pairs trading.
In pairs trading, an identified strong positive correlation between the prices of two
stocks is used to detect a market opportunity. Suppose the stock prices of Company A
and Company B consistently move together. Time series analysis can be applied to the
difference of these companies' stock prices over time. A statistically larger than expected
price difference indicates that it is a good time to buy the stock of Company A and sell
the stock of Company B, or vice versa.

4. What are the components of time series? Explain each of them. Also write the main
steps of Box-Jenkins methodology for time series analysis.
A time series can consist of the following components:

Hiren Parkar 22306A1031


 Trend (a long period of time)
 Seasonality (within a year)
 Cyclic (a span of more than one year)
 Random
1) Trend –
i)The trend refers to the long-term relatively smooth pattern that persists over number of
years in a time series.
ii)It indicates whether the observation values are increasing or decreasing over time.
iii) Examples of trends are a steady increase in sales month over month, number of airline
passengers, the population, agricultural production, items manufactured, number of births
and deaths, number of industry or any factory, number of schools or colleges etc.
2) Seasonality-
i)The seasonality component describes a pattern appears in a regular interval wherein the
frequency of occurrence is within a year or even shorter .
ii) This variation will be present in a time series if the data are recorded hourly, daily,
weekly, quarterly, or monthly.
iii) For example, monthly retail sales can fluctuate over the year due to the weather and
holidays.
3) Cyclic-
i) A cyclic component also refers to a periodic fluctuation, but beyond a frequency of one
year.
ii)For example, retails sales are influenced by the general state of the economy. Thus, a
retail sales time series can often follow the lengthy boom-bust cycles of the economy.
4) Random-
Although noise is certainly part of this random component, there is often some underlying
structure to this random component that needs to be modelled to forecast future values of
a given time series.

The Box-Jenkins methodology for time series analysis involves the following three main
steps:
1) Condition data and select a model.
a. Identify and account for any trends or seasonality in the time series,
b. Examine the remaining time series and determine a suitable model.
2) Estimate the model parameters.
3) Assess the model and return to Step 1, if necessary.

5. Explain Autoregressive Integrated Moving Average Model in detail.

6. What are major challenges with text analysis? Explain with examples.
i) Text analysis suffers from the curse of high dimensionality.
e.g. If there are 50 distinct words, then its call a book with dimension 50.
The smallest corpus(quantity) in the list, the complete works of Shakespeare, contains about
0.88 million words.

Hiren Parkar 22306A1031


ii) In contrast, the Google n-gram corpus(a collection of written texts,) contains one trillion
words from publicly accessible web pages.
Out of the one trillion words in the Google n-gram corpus, there might be one million distinct
words, which would correspond to one million dimensions.
iii) The high dimensionality of text is an important issue, and it has a direct impact on the
complexities of many text analysis tasks.
iv) Another major challenge with text analysis is that most of the time the text is not
structured.
v) As we know, data may be semi-structured(XML), quasi-structured(data with irregular data
formats that can be formatted with effort, tools, and time) or unstructured data.
vi) Table 9-2 on the next slide shows some example data sources and data formats that text
analysis may have to deal with.

Hiren Parkar 22306A1031


7. What are various text analysis steps? Explain in detail.
A text analysis problem usually consists of three important steps:
• Parsing –
i) Parsing is the process that takes unstructured text and imposes a structure for further
analysis.
ii) The unstructured text could be a plain text file, a weblog, an Extensible Markup Language
(XML) file, a Hyper Text Markup Language (HTML) file, or a Word document.
iii) Parsing deconstructs the provided text and renders it in a more structured way for the
subsequent steps.

• Search and Retrieval –


i) Search and retrieval is the identification of the documents in a corpus that contain search
items such as specific words, phrases, topics, or entities like people or organizations.
ii) These search items are generally called key terms. Search and retrieval originated from
the field of library science and is now used extensively by web search engines.

• Text Mining -
i) Text mining discovers the meaningful insights pertaining to domains or problems of
interest.
ii) With the proper representation of the text, many of the techniques such as clustering and
classification, can be adapted to text mining.
iii) For example, the k-means can be modified to cluster text documents into groups, where
each group represents a collection of documents with a similar topic. The distance of a
document to a centroid represents how closely the document talks about that topic.

8. How to retrieve information and applying text analysis? Explain with respect to Term
Frequency.

Hiren Parkar 22306A1031

You might also like