Unit4 Notes
Unit4 Notes
UNIT 4
NOTES MATERIAL
OBJECT SEGMENTATION
TIME SERIES METHODS
Topics:
Object Segmentation:
Supervised and Unsupervised Learning
Segmentaion & Regression Vs Segmentation
Regression, Classification, Overfitting,
Decision Tree Building
Pruning and Complexity
Multiple Decision Trees etc.
Steps:
Define purpose – Already mentioned in the statement above
Identify critical parameters – Some of the variables which come up in mind
are skill, motivation, vintage, department, education etc. Let us say that
basis past experience, we know that skill and motivation are most
important parameters. Also, for sake of simplicity we just select 2
variables. Taking additional variables will increase the complexity, but
can be done if it adds value.
Granularity – Let us say we are able to classify both skill and motivation
into High and Low using various techniques.
There are two broad set of methodologies for segmentation:
Objective (supervised) segmentation
Non-Objective (unsupervised) segmentation
Objective Segmentation
Segmentation to identify the type of customers who would respond to
a particular offer.
Segmentation to identify high spenders among customers who will
use the e- commerce channel for festive shopping.
Segmentation to identify customers who will default on their credit
obligation for a loan or credit card.
Non-Objective Segmentation
https://www.yieldify.com/blog/types-of-market-segmentation/
Segmentation of the customer base to understand the specific profiles
which exist within the customer base so that multiple marketing
actions can be personalized for each segment
Segmentation of geographies on the basis of affluence and lifestyle of
people living in each geography so that sales and distribution strategies
can be formulated accordingly.
Hence, it is critical that the segments created on the basis of an
objective segmentation methodology must be different with respect to
the stated objective (e.g. response to an offer).
However, in case of a non-objective methodology, the segments are
different with respect to the “generic profile” of observations belonging
to each segment, but not with regards to any specific outcome of
interest.
The most common techniques for building non-objective segmentation
are cluster analysis, K nearest neighbor techniques etc.
Regression Vs Segmentation
Regression analysis focuses on finding a relationship between a
dependent variable and one or more independent variables.
Predicts the value of a dependent variable based on the value of
at least one independent variable.
Explains the impact of changes in an independent variable on the
dependent variable.
We use linear or logistic regression technique for developing accurate
models for predicting an outcome of interest.
Often, we create separate models for separate segments.
Segmentation methods such as CHAID or CRT is used to judge their
effectiveness.
Creating separate model for separate segments may be time consuming
and not worth the effort. But, creating separate model for separate
segments may provide higher predictive power.
Decision Tree is a supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for
solving Classification problems.
Decision Trees usually mimic human thinking ability while making a
decision, so it is easy to understand.
A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is a tree-structured classifier, where internal nodes represent the
features of a dataset, branches represent the decision rules and each leaf
node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and
Leaf Node. Decision nodes are used to make any decision and have
multiple branches, whereas Leaf nodes are the output of those decisions
and do not contain any further branches.
Basic Decision Tree Learning Algorithm:
Now that we know what a Decision Tree is, we’ll see how it works
internally. There are many algorithms out there which construct
Decision Trees, but one of the best is called as ID3 Algorithm. ID3
Stands for Iterative Dichotomiser 3.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
Decision Tree Representation:
Each non-leaf node is connected to a test that splits its set of
possible answers into subsets corresponding to different test results.
Each branch carries a particular test result's subset to another node.
Each node is connected to a set of possible answers.
Below diagram explains the general structure of a decision tree:
Decision trees use multiple algorithms to decide to split a node into two or more
sub- nodes. The creation of sub-nodes increases the homogeneity of resultant
sub-nodes. In other words, we can say that the purity of the node increases
with respect to the target variable. The decision tree splits the nodes on all
available variables and then selects the split which results in most
homogeneous sub-nodes.
Tree Building: Decision tree learning is the construction of a decision tree from
class- labeled training tuples. A decision tree is a flow-chart-like structure,
where each internal (non-leaf) node denotes a test on an attribute, each
branch represents the outcome of a test, and each leaf (or terminal) node
holds a class label. The topmost node in a tree is the root node. There are
many specific decision-tree algorithms. Notable ones include the following.
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits
when computing classification trees)
MARS → (multivariate adaptive regression splines): Extends decision trees to handle
numerical data better
Conditional Inference Trees → Statistics-based approach that uses non-parametric
tests as splitting criteria, corrected for multiple testing to avoid over fitting.
The ID3 algorithm builds decision trees using a top-down greedy search approach
through the space of possible branches with no backtracking. A greedy
algorithm, as the name suggests, always makes the choice that seems to be
the best at that moment.
In a decision tree, for predicting the class of the given dataset, the algorithm
starts from the root node of the tree. This algorithm compares the values of
root attribute with the record (real dataset) attribute and, based on the
comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the
other sub- nodes and move further. It continues the process until it reaches the
leaf node of the tree. The complete process can be better understood using
the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for
the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the
dataset created in
Step -6: Continue this process until a stage is reached where you
cannot further classify the nodes and called the final node as a leaf
node.
Entropy:
Entropy is a measure of the randomness in the information
being processed. The higher the entropy, the harder it is to
draw any conclusions from that information. Flipping a coin
is an example of an action that provides information that is
random.
From the graph, it is quite evident that the entropy H(X) is zero when the
probability is either 0 or 1. The Entropy is maximum when the probability is
0.5 because it projects perfect randomness in the data and there is no chance
if perfectly determining the outcome.
Information Gain
Information gain or IG is a statistical property that
measures how well a given attribute separates the
training examples according to their target
classification. Constructing a decision tree is all
about finding an attribute that returns the highest
information gain and the smallest entropy.
ID3 follows the rule — A branch with an entropy of zero is a leaf node and A branch with
entropy more than zero needs further splitting.
InformationGain( Attribute)
I (p , n ) ni
pi log log
pi
n
2
i i i 2
pn pn n
p pn
i i i i i i i
i
Entropy of an Attribute
is:
Entropy( Attribute)
p n I p n
i i
i
PN i
Data set:
Basic algorithm for inducing a decision tree from training tuples:
Algorithm:
Generate decision tree. Generate a decision tree from the
training tuples of data
partition D.
Input:
Data partition, D, which is a set of training tuples and their
associated class labels;
attribute list, the set of candidate attributes;
Attribute selection method, a procedure to determine the
splitting criterion that “best” partitions the data tuples into
individual classes. This criterion consists of a splitting
attribute
and, possibly, either a split point or splitting subset.
Output: A decision tree.
Method:
(1) create a node N;
(2) if tuples in D are all of the same class, C then
return N as a leaf node labeled with the class C;
(3) if attribute list is empty then
return N as a leaf node labeled with the majority class in
D;
// majority voting
(4) apply Attribute selection method(D, attribute list) to find
the
“best”
splitting criterion;
(5) Label node N with splitting criterion;
(6) if splitting attribute is discrete-valued and multiway
splits allowed
then // not restricted to binary trees
(7) attribute list= attribute list - splitting attribute
(8) for each outcome j of splitting criterion
// partition the tuples and grow subtrees
for
each partition
(9) let Dj be the set of data tuples in D satisfying outcome j;
// a partition
(10) if Dj is empty then
attach a leaf labeled with the majority class in D to node
N;
else
attach the node returned by Generate decision tree(Dj,
attribute list) to node N;
(11) return N;
Classification Trees:
A classification tree is an algorithm where
the target variable is fixed or
categorical. The algorithm is then used
to identify the “class” within which a
target variable would most likely fall.
An example of a classification-type
problem would be determining who will
or will not subscribe to a digital
platform; or who will or will not
graduate from high school.
These are examples of simple binary classifications where the categorical
dependent variable can assume only one of two, mutually exclusive
values.
Regression Trees
A regression tree refers to an
algorithm where the target variable is
and the algorithm is used to predict its
value which is a continuous variable.
As an example of a regression type
problem, you may want to predict the
selling prices of a residential house,
which is a continuous dependent
variable.
This will depend on both continuous
factors like square footage as well as
categorical factors.
One of the questions that arises in a decision tree algorithm is the optimal size of
the final tree. A tree that is too large risks overfitting the training data and
poorly generalizing to new samples. A small tree might not capture important
structural information about the sample space. However, it is hard to tell when
a tree algorithm should stop because it is impossible Before and After pruning
to tell if the addition of a single extra node will dramatically decrease error.
This problem is known as the horizon effect. A common strategy is to grow the
tree until each node contains a small number of instances then use pruning to
remove nodes that do not provide additional information. Pruning should
reduce the size of a learning tree without reducing predictive accuracy as
measured by a cross-
validation set. There are many techniques for tree pruning that differ in the
measurement that is used to optimize performance.
Pruning Techniques:
Pruning processes can be divided into two types: PrePruning & Post Pruning
Pre-pruning procedures prevent a complete induction of the training set
by replacing a stop () criterion in the induction algorithm (e.g. max. Tree
depth or information gain (Attr)> minGain). They considered to be more
efficient because they do not induce an entire set, but rather trees
remain small from the start.
Post-Pruning (or just pruning) is the most common way of simplifying
trees. Here, nodes and subtrees are replaced with leaves to reduce
complexity.
The procedures are differentiated on the basis of their approach in the tree:
Top-down approach & Bottom-Up approach
dataset. This is because the model has trained itself in a very complex
manner and has high variance.
The best fit model is shown by the middle graph, where both training and
testing (validation) loss are minimum, or we can say training and
testing accuracy should be near each other and high in value.
Time Series Methods:
Time series forecasting focuses on analyzing data changes across
equally spaced time intervals.
Time series analysis is used in a wide variety of domains, ranging
from econometrics to geology and earthquake prediction; it’s also used
in almost all applied sciences and engineering.
Time-series databases are highly popular and provide a wide spectrum of
numerous applications such as stock market analysis, economic and
sales forecasting, budget analysis, to name a few.
They are also useful for studying natural phenomena like atmospheric
pressure, temperature, wind speeds, earthquakes, and medical
prediction for treatment.
Time series data is data that is observed at different points in time. Time
Series Analysis finds hidden patterns and helps obtain useful insights
from the time series data.
Time Series Analysis is useful in predicting future values or detecting
anomalies from the data. Such analysis typically requires many data
points to be present in the dataset to ensure consistency and
reliability.
The different types of models and analyses that can be created through
time series analysis are:
o Classification: To Identify and assign categories to the data.
o Curve fitting: Plot the data along a curve and study the
relationships of variables present within the data.
o Descriptive analysis: Help Identify certain patterns in time-series
data such as trends, cycles, or seasonal variation.
o Explanative analysis: To understand the data and its relationships,
the dependent features, and cause and effect and its tradeoff.
o Exploratory analysis: Describe and focus on the main characteristics
of the time series data, usually in a visual format.
o Forecasting: Predicting future data based on historical trends. Using
the historical data as a model for future data and predicting
scenarios that could happen along with the future plot points.
o Intervention analysis: The Study of how an event can change the data.
o Segmentation: Splitting the data into segments to discover the
underlying properties from the source information.
Components of Time Series:
Long term trend – The smooth long term direction of time series where the data
can increase or decrease in some pattern.
Seasonal variation – Patterns of change in a time series within a year which tends
to repeat every year.
Cyclical variation – Its much alike seasonal variation but the rise and fall of time
series over periods are longer than one year.
Irregular variation – Any variation that is not explainable by any of the three above
mentioned components. They can be classified into – stationary and non –
stationary variation.
Stationary Variation: When the data neither increases nor decreases, i.e. it’s
completely random it’s called stationary variation. Or When the data has some
explainable portion remaining and can be analyzed further then such case is
called non – stationary variation.
The acronym ARIMA stands for Auto-Regressive Integrated Moving Average. Lags of
the stationarized series in the forecasting equation are called
"autoregressive" terms, lags of the forecast errors are called "moving
average" terms, and a time series which needs to be differenced to be made
stationary is said to be an "integrated" version of a stationary series.
Random-walk and random-trend models, autoregressive models, and exponential
smoothing models are all special cases of ARIMA models.
A nonseasonal ARIMA model is classified as an "ARIMA(p,d,q)" model, where:
The forecasting equation is constructed as follows. First, let y denote the dth
difference of Y, which means:
If d=0: yt = Yt
If d=1: yt = Yt - Yt-1
Note that the second difference of Y (the d=2 case) is not the difference
from 2 periods ago. Rather, it is the first-difference-of-the-first difference,
which is the discrete analog of a second derivative, i.e., the local acceleration
of the series rather than its local trend.
(e ) i
MFE i1
n
Ideal value = 0; MFE > 0, model tends to under-forecast MFE < 0, model tends
to over- forecast 2. Mean Absolute Deviation (MAD) For n time periods where
we have actual demand and forecast values:
n
(e ) i
MAD i 1
n
While MFE is a measure of forecast model bias, MAD indicates the absolute
size of the errors
Uses of Forecast error:
Forecast model bias
Absolute size of the forecast errors
Compare alternative forecasting models
Identify forecast models that need adjustment
ETL Approach:
Extract, Transform and Load (ETL) refers to a process in database usage and
especially in data warehousing that:
Extracts data from homogeneous or heterogeneous data sources
Transforms the data for storing it in proper format or structure for
querying and analysis purpose
Loads it into the final target (database, more specifically, operational
data store, data mart, or data warehouse)
Usually all the three phases execute in parallel since the data extraction takes
time, so while the data is being pulled another transformation process
executes, processing the already received data and prepares the data for
loading and as soon as there is some data ready to be loaded into the target,
the data loading kicks off without waiting for the completion of the previous
phases.
ETL systems commonly integrate data from multiple applications (systems),
typically developed and supported by different vendors or hosted on separate
computer hardware. The disparate systems containing the original data are
frequently managed and operated by different employees. For example, a
cost accounting system may combine data from payroll, sales, and
purchasing.
Commercially available ETL tools include:
Anatella
Alteryx
CampaignRunner
ESF Database Migration Toolkit
InformaticaPowerCenter
Talend
IBM InfoSphereDataStage
Ab Initio
Oracle Data Integrator (ODI)
Oracle Warehouse Builder (OWB)
Microsoft SQL Server Integration Services (SSIS)
Tomahawk Business Integrator by Novasoft Technologies.
Pentaho Data Integration (or Kettle) opensource data integration
framework
Stambia
Diyotta DI-SUITE for Modern Data Integration
FlyData
Rhino ETL
SAP Business Objects Data Services
SAS Data Integration Studio
SnapLogic
Clover ETL opensource engine supporting only basic partial
functionality and not server
SQ-ALL - ETL with SQL queries from internet sources such as APIs
North Concepts Data Pipeline
There are various steps involved in ETL. They are as below in detail:
Extract:
The Extract step covers the data extraction from the source system and makes it
accessible for further processing. The main objective of the extract step is to
retrieve all the required data from the source system with as little resources
as possible. The extract step should be designed in a way that it does not
negatively affect the source system in terms or performance, response time
or any kind of locking.
There are several ways to perform the extract:
Update notification - if the source system is able to provide a notification
that a record has been changed and describe the change, this is the
easiest way to get the data.
Incremental extract - some systems may not be able to provide
notification that an update has occurred, but they are able to identify
which records have been modified and provide an extract of such
records. During further ETL steps, the system needs to identify changes
and propagate it down. Note, that by using daily extract, we may not be
able to handle deleted records properly.
Full extract - some systems are not able to identify which data has
been changed at all, so a full extract is the only way one can get the
data out of the system. The full extract requires keeping a copy of the
last extract in the same format in order to be able to identify changes.
Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is
extremely important. Particularly for full extracts; the data volumes
can be in tens of gigabytes.
Clean: The cleaning step is one of the most important as it ensures
the quality of the data in the data warehouse. Cleaning should perform
basic data unification rules, such as:
Making identifiers unique (sex categories Male/Female/Unknown,
M/F/null, Man/Woman/Not Available are translated to standard
Male/Female/Unknown)
Convert null values into standardized Not Available/Not Provided
valueConvert phone numbers, ZIP codes to a standardized form
Validate address fields, convert them into proper naming, e.g.
Street/St/St./Str./Str
Validate address fields against each other (State/Country, City/State,
City/ZIP code, City/Street).
Transform:
The transform step applies a set of rules to transform the data from the
source to the target.
This includes converting any measured data to the same dimension (i.e.
conformed dimension) using the same units so that they can later be
joined.
The transformation step also requires joining data from several sources,
generating aggregates, generating surrogate keys, sorting, deriving new
calculated values, and applying advanced validation rules.
Load:
During the load step, it is necessary to ensure that the load is performed
correctly and with as little resources as possible. The target of the Load
process is often a database.
In order to make the load process efficient, it is helpful to disable any
constraints and indexes before the load and enable them back only
after the load completes. The referential integrity needs to be
maintained by ETL tool to ensure consistency.
In order to derive the Hypothesis space, we compute the Entropy and Information Gain of
Class and attributes. For them we use the following statistics formulae:
InformationGain( Attribute)
pi pi ni ni
(p
I i , ) log log
pi 2 n pi 2 n
i
ni pi i
pi
ni ni
Entropy of an Attribute is:
Data set: