Statistical Data Analysis
(DA204C)
1/5
Descriptive analytics
► Descriptive analysis is data simplification, where past data is collected, organised and then
presented in a way that is easily understood.
► A good descriptive analysis is meant to answer relevant research questions. However, unlike other
methods of analysis, it is not used to draw inferences or predictions from its findings.
► As it is the most simplistic form of data analytics, it can stand on its own as a research product.
► Descriptive analysis uses data aggregation and data mining as two key methods for analytics.
Table: Descriptive data analysis – h t t p s : / / c o l a b . r e s e a r c h . g o o g l e . com/
d r i v e / 1ImOzFu11jATJOs1Jt0Vro0qWOYGbb_ _ l ? usp= s h a r i n g
Predictive analytics
– The term refers to the use of statistical measures and modeling techniques to make
predictions about the future events.
– It is a causal analysis tool to improve performance and minimize risks.
– Predictive analytic models help to make predictions towards a variety of data, such as
weather forecasting, or marketing trend analysis.
– Predictive models use regression methods, decision trees, and neural networks for
analytical purpose.
– In succession to predictive analysis, perspective analysis uses the knowledge gained at
previous two steps to determine future course of actions
– Prescriptive analytics anticipates what, when and, importantly, why something
might happen.
Predictive analytics tools – Decision trees
Decision trees are used to visually and explicitly represent
decisions and decision making.
• It is a common analytic tool for classification and
regression tasks.
• The trees are implemented upside–down, the root node
being on the top.
• Decision tree algorithms are referred to as CART
(classification and regression trees).
– Decisions are generally learnt on the basis of recursive binary splitting.
–Each feature of the dataset acts as a candidate to determine the cost of splitting.
–The candidate feature imparting the least cost of splitting is chosen based on Greedy
algorithm.
Predictive analytics tools – Decision trees (Contd.)
– Minimizing cost function in decision trees is a process to find most homogeneous branches
– For regression tasks, the cost function usually is the mean of distance between predicted
data points (yˆ) and actual curve (y ).
N
1
L= Σ (y i − yˆ)i 2
N
i =1
This cost is calculated for all candidate splits, and the candidate with minimum cost is
chosen.
– For classification, entropy or cross–entropy functions are used for determining cost.
– Gini index is also a good measure. It is given by
1 K
G=
K Σ p (1 − p )
K k
k =1
where pk is the proportion of the same class in k th group.