0% found this document useful (0 votes)

16 views11 pages

Data Analysis 1

The document discusses various branches of data science, including machine learning, statistics, data mining, and big data, emphasizing their interconnections and applications. It highlights techniques such as churn analysis, clustering, and sampling methods, explaining their importance in extracting insights from data. The document also covers supervised and unsupervised learning methods, providing a foundational understanding of data analysis and modeling.

Uploaded by

Rana Ben Fraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views11 pages

Data Analysis 1

Uploaded by

Rana Ben Fraj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Session1

stat 1 w stat 2 la econometrics kol chay ici

1. **Machine Learning**: This branch focuses on developing algorithms that enable computers to learn and make predictions or
decisions without being explicitly programmed. It includes various types of learning paradigms such as supervised,
unsupervised, and reinforcement learning.

2. **Statistics**: Statistics forms the foundation of data science. It involves the collection, analysis, interpretation, and
presentation of data. Statistical methods are used to make inferences and draw conclusions from data.

3. **Data Mining**: Data mining involves the process of discovering patterns and relationships in large datasets. It often
involves techniques like clustering, association rule mining, and anomaly detection.

4. **Big Data**: This branch deals with the processing and analysis of extremely large and complex datasets that traditional
data processing techniques may struggle to handle. It includes technologies like Hadoop and Spark.

continuation mteehom
data analysis? data mining? machine learning?
multivariate analysis
umbrella area = data science fiha kol chay
if i calculate x bar machine learning operation = use past info to predict sth unknown?
=> principle of machine learning
book : statistical learning

Data science encompasses a wide range of techniques and methods for extracting knowledge and insights from data. Some of
the main branches of data science include:
5. **Data Visualization**: Data visualization is the practice of representing data in graphical or visual formats to help users
understand the patterns, trends, and insights within the data.

6. **Natural Language Processing (NLP)**: NLP focuses on enabling computers to understand, interpret, and generate human
language. It's used in applications like sentiment analysis, machine translation, and chatbots.

7. **Time Series Analysis**: This branch is concerned with analyzing data that is ordered by time. It's commonly used in fields
like finance, economics, and environmental science to make forecasts and predictions.

8. **Deep Learning**: Deep learning is a subset of machine learning that utilizes deep neural networks with multiple layers to
learn complex features from data. It has been particularly successful in tasks like image recognition and natural language
processing.

9. **Bayesian Methods**: Bayesian statistics is a framework for statistical inference where probability distributions are assigned
to both the parameters and data. It's used for tasks like parameter estimation and uncertainty quantification.

10. **Reinforcement Learning**: This is a type of machine learning where an agent learns to make a sequence of decisions by
interacting with an environment. It's commonly used in areas like robotics and game playing.

11. **Factor Analysis and Dimensionality Reduction**: These techniques aim to reduce the complexity of data by representing it
in a lower-dimensional space while retaining as much of the relevant information as possible.

12. **Optimization**: Optimization methods are used to find the best solution to a problem from a set of possible solutions. It's
widely used in training machine learning models and solving various real-world problems.

13. **Experimental Design**: This branch focuses on designing experiments to collect data in a way that allows for valid and
reliable statistical inference.

These branches often overlap and complement each other. Data scientists draw from various disciplines to select the most
appropriate techniques and methods for a given problem or dataset. The choice of which branch to use depends on the specific
nature of the data and the objectives of the analysis.

supervised : using past available info to make rules and predict the future
machine learning
ibn kholdoun terikh bch taarf mostakbel
principle of econometrics :
supervised learning : we train algo ion available data to predict the future
unsupervised : explore and extract knowledge and useful information from the data
eg techniques used to visualize data : an unsupervised learning method par exemple
eg calculating the variance of a given variable to have an idea about the range of the future of the variable
factor analysis : unsupervised learning
constructing confidence interval
computing correlation between variables kif kif unsupervised
khater we’re exploring the data and trying to get useful info
semi supervised????
Why are these techniques important and why do we use them and eventually make money?
*churn analysis !!!!
Churn analysis, also known as customer attrition analysis, is the process of examining and understanding the rate at which
customers stop using a product, service, or subscription over a specific period of time. The customers who discontinue their use
are referred to as "churned" or "lost" customers.

Here's a step-by-step explanation of churn analysis:

1. **Define the Churn Event**: The first step is to define what constitutes a "churn event" for your specific business. This could
be when a customer cancels a subscription, stops using a service, or makes their last purchase. It's important to have a clear
definition to accurately track and analyze churn.

2. **Data Collection**: Gather relevant data about customer behavior, interactions, and transactions. This data might include
customer sign-up dates, usage patterns, purchase history, customer demographics, customer feedback, and any other relevant
information.

3. **Calculate Churn Rate**: The churn rate is the percentage of customers who have churned within a specific time period. It's
typically calculated using the formula:

This gives you a percentage that indicates how many customers you've lost relative to your total customer base.

4. **Segmentation**: Analyze churn based on different customer segments. For example, you might want to know if churn rates
differ by geographic location, subscription tier, or other demographic factors. This helps identify which segments are more prone
to churn and can inform targeted retention efforts.

5. **Time Period Analysis**: Churn rates may vary over time. Analyze churn on a monthly, quarterly, or yearly basis to identify
trends. Sudden spikes or drops in churn rates might indicate specific events or changes in your business that need further
investigation.

6. **Customer Feedback and Surveys**: Collect feedback from churned customers through surveys or interviews. This
qualitative data can provide valuable insights into the reasons behind their decision to leave. Understanding these reasons can
help in developing strategies to reduce churn.

7. **Root Cause Analysis**: Identify the underlying causes of churn. These could include factors like poor customer service,
product dissatisfaction, better competitive alternatives, pricing issues, or changes in customer circumstances.

8. **Predictive Churn Modeling**: Use machine learning techniques to build predictive models that estimate the likelihood of a
customer churning in the future. These models can help prioritize retention efforts for at-risk customers.

9. **Implement Retention Strategies**: Based on the insights gained from the analysis, develop and implement targeted
retention strategies. This might include improving customer service, enhancing product features, offering incentives, or
adjusting pricing.

10. **Monitor and Iterate**: Continuously monitor churn rates and the effectiveness of your retention strategies. Adjust your
approach as needed to address changing customer behavior and market conditions.

Churn analysis is a critical component of customer relationship management and can have a significant impact on a business's
profitability and long-term success. By understanding why customers leave, businesses can take proactive steps to retain
existing customers and ultimately improve their bottom line.

info from the analysis + customer behaviour => u can make an accurate model
u have to feed and train the algo
orange tamale churn analysis ( w hata l banks)
predict wala forecast automatiquement ⇒ SUPERVISED
*association rule mining technique (conditional proba based uwu)
tkhalik
market basket analysis
● text books
sql, coursera maher challouf?
python
r

factor analysis
k means clustering
agglomerative hierarchical clustering
aleh right nb of clustering w kifeh naafou? (suitable nb of clustering
k-nearest neighbor KNN
markdown *python
*kaggle
[Link]
extreme gradient boosting algorithm
_________________________________
ch : sampling
how to get representative sample?
u can draw any sample
opinion polls = sondage
pk c important sampling?
psk most of the time the population is large (most of the time it’s expensive money and time wise)
census (population kemla)
(A census is the procedure of systematically acquiring, recording and calculating population
information about the members of a given population)
sampling (estimation of the population average ) vs census(working on each observation of the data)
in practice, most of the time census for social studies surtout is impossible
eg avergae ppl’s height
te3ded census
most of the time we work on samples and not populations (except when the population is small)
why, if we draw a large sample, we can use the sample to predict the true average parameter
big nb laws
andna frame : feha info ala every population member ( lezm yabdw andhom equal chances of being selected 1/N?)
sample should inherit the population’s characteristics
(The three rules of the central limit theorem are as follows:
● The data should be sampled randomly.
● The samples should be independent of each other.
● The sample size should be sufficiently large but not exceed 10% of the
population.
)
fel discipline taa stat w proba
narch fair wala unfair
i wanna estimate the probability of having success w failed

*SLIDE MTAAA L SIMULATION NHOTHA FEL R

Session2
if the sample isn’t representative, all info is useless, waste of time and money
representative c quoi?
if it inherits the population’s features, properties, and characteristics
how: choose RANDOMLY, AT RANDOM observations from the population ( thz opposite would be BIASED sampling)
it must be drawn randomly
it must big enough to inherit the population’s characteristics
we use no rules (mch kol jmea tji AT 8 par exemple khtr ynajmou ykounou edhoukom yfikou bekri w andhom kn fazet w hajet le)
RANDOM RANDOM
2 goals: at random w increased sample size
random: means u have the whole population in the same urn
however, in practice, we don’t have that case eli 100% random
in statistics, nkoul we have a frame
the first problem : to get the frame ( wchkhas ken yetlaaa it’s rich in term of information) ( akeka u can divide it into
homogeneous groups)

probability sampling techniques

1. simple random sampling : big urn containing all X (with or without replacement)
#ref loi de bernoulli
pk expensive?

2 main approaches in sampling

part 1 probability sampling techniques ( houni we have the frame: contains ppl’s contacts, but can also other additional info taht
can be used later to stratify the population into many homogenous groups, and that can be useful)
part 2 non probability sampling techniques (most famous w practical quota technique) (if u don’t have the proper frame)
(voir les slides pour es different techniques de sampling)

plage de données tabda kbira barcha l f twali sghira barcha donc it doesn’t matter ken with or without replacement

in a case of
________________________________________________________________________________________________
Post midterms session
supervised : using historical data to forecast w make rules hedhouma lel estimations fama target maaneha econometrics lkol,
time series analysis, classification models,
unsupervised : ken famech targe twala haja u wanna estimate through h genre pca, clustering, association mining

CLUSTERING
there are 2 main concepts
classification (classiification automatique ou non supervisée w lokhra supervisée jst f anglais fama zouz kelmet fr jst non w ey)
(hedha supervised, categorical target? eg: abd bch yanjah wale) vs clustering : data is unlabeled i don’t have a categorical
target to predict a black box? the main objective is to separate/ divide groups of data wala divide it into homogeneous groups
textbook:Machine Learning with R brett lantz version 2019
hands on machine learning with python
the most famous algorithm in machine learning f clustering: K-means algorithm
fama 3 versions ?

K-means is a method used to group or cluster similar data points into distinct groups. The "K" in
K-means refers to the number of clusters you want to create. Here's how it works:

Initialization: Start by randomly placing K points in your data space. These points represent
the initial guesses for the centers of your clusters.
Assignment: Assign each data point to the cluster whose center is closest to it. The
"closeness" is typically measured using the Euclidean distance.
Update: Recalculate the cluster centers by finding the mean of all the data points assigned to
each cluster. These means become the new cluster centers.
Repeat: Repeat steps 2 and 3 until the cluster assignments and the cluster centers no longer
change significantly.

ki f algo yaamel n start déjà yjareb barcha 2 points w yebda b ahsen 2 win fazet intertia
k means ken b numerical
when u standardize data tnajem dra chneya taamel faza related lel categorical ama mch barcha nakhrawesh fih khatr c pas fait
pr ça?
[Link]
python :
[Link]

spectral clustering algorithm

specc(data, k) yekhdm non convex clusters

_____
yentale9 xD men ahsen points eli andha adh3ef sum of squares ?
yekhtarch ay couple yekhtar the best yokod yaawed yekhtar barcha baad ytestihom houma kolhom random w yekhou eli fehom
a9al intertia ( fpython wala R esmha n start w taatih kadeh men points)
andou urne fih 1000 boules
kol wahda feha coordinates taa individual
enti tkolou k = 3 par exemple, yekhou 3 at random yekhdhom w ykayadhom w yraajaahom w yokod yaawed akeka, f espace
ychouf points he chose at random yehseb objective function f kol choice amlou lkol hasb n start eli houa argument kadeh
hatitou par exemple 10 yekhou CHOICES YEKHOU JMEAA LI ANDHOM AKAL SUM OF SQUARES
+++++++
BEHI f k means telka k more? k median?
____________________________________
Hierarchical Clustering
1. distance between sets of individuals

● complete link criteria

d(G1,G2) = d(I8,I5) (ab3ed wahdet)
● average link criteria (average GHALTA ESMHA CENTROID CRITERION !!!!!!!)
d(G1,G2) d(C1,C2) (C1 w C2 houma centers houni 7atitou smiley lol ama li alisar for some reason habech yeta3mal
ama supposed ykoun f west bref )
● simple link criteria
d(G1,G2) d(I10,I3) (melekher a9reb points l baadhhom aks loula )

by using single link criteria

individual 3 w 4 yetsamew akreb jmeaa l baadhhom nehsboum clusters kol wehed wahdou?
namlelhom projection al axe X
❇️ ❇️
aka l faza li amalha (check pic f tel brain ref )
clustering dendogram
cutree function
in statistics, Ward's method is a criterion applied in hierarchical cluster analysis. Ward's minimum variance method is a special
case of the objective function approach originally presented by Joe H. Ward, Jr
méthode des pyramides?
méthode des pyramides edwin didé
_____
mubarki tips 101 xDDDD
___________________
andrew dra chneya coursera deep learning
____
text mining sentiment analysis?
__________________
back
kife naarf which method is better?
classe enchevêtrée
method “complete” we take the 2 farthest individuals
hclust complete
____
euclidian distance
manhatten distance
tuning
site :
[Link]/k-means-clustering-concepts-and-implementation-in-r-for-data-science-32cae6a3ceba

___
so good morning everyone ahna krina clustering
krina clustering
clustering houa haja it’s unsupervised learning
kademt clustering (hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh)
___
agglomerative hierarchical approach?
dendrogram
why clustering data is important n diff? HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH :(

unsupervised is harder than supervised mainly khatr it’s hard to evaluate the results in the unsupervised learning area
asslema maram bjr comment cv
ch score?
paper?
help?

f nb of clusters soit nestaamlou distances between clusters wala inside

criteria
1. homogeneity (compactness) : traditional faza wss: interitia wss bch tkaren compactness taa partitions testaamel
WSS as a criterion
2. isolation tehsb distance bin clusters (andna global center w class center) BSS (between sample sum of square
distances) (isolation forest method)
cooper 1985 ? clustering ( k means)
elbow criteria? highest jumb? beginner fazet w weak w mahomch advanced
ch index advanced cv formula ?
stability approach? ta9yim xD
yaamel data perturbation hhhhhhhhhhhh ynahi par exemple 20% m data w yaawed yaamela clustering wala yaamel individual
replication wala u add white noise :o bch akeka taswira original tra3ech edez edez aywa aywa y9arenha f reference partition
baad yalka partition li lkaaha is more stable than eli fel 4 clusters w houa aaamel 2 wala aks wala idk ‘niveau poussé sehbi

c koi stability approach : ken andk clusters w andk random partition (donc clustering random data) automatically ken data
dekhla baadhha ken taamel perturbation mara telka k 1 mara telka 2 ? ki taamel perturbation clusters yodkhlou baadhhom w
reference partition gets lost during the process
adjusted rand index? l’indice de rand ajusté?
(pca ? spherity test of bart dra chneya)
ma ykoun pca behya ken ma faama chute w fama big differences w gaps f eigenvalues
lezm dima dima taamel evaluation, taamel model performance analysis to make sure eli model jawou behi xD
clustering maghir validation belha w ochreb meha hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh ma ando ma yendeb beha :clown:
clustering validation fib 2 volets
1. enehoua nb of clusters s7i7 ch index, stability approach( taamel perturbation wala sample b white noise)?
2. enehya l méthode s7i7a (complete link? single link? k means? ) which one is best?
ken partition taatik interpretable results w mafhouma, this indicates eli takssim mafhoum w differences wadhhin donc
good indicator
clustering f text? ttekhou barcha pdf files (data / web scrapping?)
tkasamhom l bcp topic (classification)
word embedding !!!!!! *package(worrd veckt? bert? gloov? ? R maadech ntawa neural networks ( R f kol chay cv ken
neural network w deep learning)
keywords?
text mining method? andi text
nnekhdhou frequent words w mch stop words (kiam the )
the words houma features f table
bel frequency taa l kelmet
fel python esmhou ngram
(matrice DTM) dispalacement tracking matrix
text codé? xD

w1 w2 w3 w4
text 1
text 2
text3
text4

k means applied to this?

NUMERical features w ytallaaa wahdou el topivs :o
word cloud.?? :o
nlp
tableau tabaak alih k means edhoukom houma topics
contextualization des words;)))
if the target is numerival staaaml regression
(regression tree, random forest, neural networks)
econometrics ypsleh lel data sets s7a7 ( l classique) machine learning khir quoi
word embedding used for sentiment analysis (a7assis) bch yehsbou similarity bin words fama cosin similarity ?
___________________________________________________
MARKET BASKET ANALYSIS (non supervised learning ) / * associative rule mining
90s bdet
he noticed that fama associations bin products
weather forecasting
association rule mining is very important w kol ama not used bcp HHHHHHH :( underrated me xD
Association rule
A—>B
wala (A,B)-->C
a couple yaati haja wala haja taati haja
P(B/A)=P(AinterB) / P(A) ⇒ confidence index
= count (A and B) / count (A) ( count tnajem trodha suppoer)
(tkhayel fazet words w texts data set ama f oudh words items w oudh text transaction)
transaction method
el items houma products
____________________
purchached
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
cnn ? networks?
PYTHON GRAPH GALLERY
R GRAPH GALLERY
lift(X=> Y) = supp (X inter Y) / summ (X) * supp (Y)
correlation heya standardized covariance
______
if lift =1 proba of X and Y occurring are indep no rule can be drawn
if the lift >1 , lets us know the degree ti which those occurrences are dependent on one another , useful for predicting the
consequent in future data sets
if the lift is <1 that lets us know the items are substitute to reach other donc pressence of one item is negative lelekher
ahsen data visaulizaiton f assocition rule mining tetekhdm bel R
___________________________________
BDSCAN?
band width?
it chooses any point from th data
par exemple min nb of points 3?
core point l centre mtaa cercle
kol mara circle lezm yalka feha 3
basically tsawer andi barcha points
w keli fama circle w t7arkou
lezm f kol blassa ma yetlawnou l points maaneha cluster ela ma ykounou el 3 par exemple edhoukom
Pros
+ it determines automatically th nb of clusters
+ _______
ckoi l’idée taa DBSCAN? based on intuitive human thinking, genre yoghzer b 3in ychouf density w yaksmou bel3in
ychouf l blayes li fehom barcha density, el dense areas ylemhom together akeka donc min nb of points w epsilon
houa eli bch ykhali tfi9 bihom
ki yalka faraghat w cercle li feha ni9at chwaya ylemhom
LESS SENSITIVE TO OUTLIERS
ylem mabaadhhom k means el single link (normalement ) PAR CONTRE DBSCAN YFI9 BEL OUTLIERS
!!!!!!!!!!!!!!!!!!!!!!!!!
Cons?
- mochkla f tuning of epsilon
fel k means andi l K nb of clusters w fel hierarchical andna criteria
ama houni mochkla kifeh u determine the parameter epsilon (stability approach?
epsilon houa radius
__
error ?
classic techniques can’t do the clustering of non convex shapes
donc lesseeee chneya yaamel DBSCAN lezgo
ken nhot epsilon f haja mouaayna w nchoufou ken it works out wale
smile ref
Density based spatial clustering of application with noise captures the insight that clusters are dense
what happens ken we decrease the nb of points? NAFTALI HARRIS visualizing dbscan clustering
(visualizing dbscan) bch yelka barcha outliers w tokhel baadhha
ahsen approche f dbscan is to analyze partition stability tjarem epsilon 0 u perturbate u cluster u compare to the ref
data etc until u have an idea about how much the ref aprtition was stable then u determine the suitable epsilon range
( very hard)
the simplest thing to do is to do clustering b R wala python
___
tuning f unsupervised as3eb mel supervised :(
mch kima econ model eli houa supervied andk adjusted R square w u can use it
tnajem tebda b epsilon = average mtaa distances bin points? wala moyenne taa std deviations barcha hajet ama
barcha intuitive maaneha nekhou noss std dev mena w mena w nekhou max w nhotou comme quoi epsilon ama at
the end of the day these all are intuitive idea and don’t have strong theoretical bases
ML? lezm telka method deterministic bch naarf chnaamel w kol
ki thot min points 1 ywali yekhdem b single link ? (is it always tbhe case?)
_
el far9 bch yssir fi definition of outliers maaneha ken bcht tnakes f min nb of points kol ma tkoun l possibility taa
fusion bin clusters akber thus chain effect akther? ki tsagher min nb of points maaneha makch exigeant barcha ken
nhot radius 2 unité w nelka point najem nehsbou denseywali ylemhom mabaadhhom ma kontch exigeant maaneha la
yezlm ykoun kbir barcha la sghir barcha par exemple siiib bch talka circle fih 40 points w unit 2 maaneha bch ywali
kol chay outliers sinn
__________________________________________
al riyadhyet wal hadhara? documentary square root of 2?
_____
Site ; [Link] DBSCAN data dbscan sample example
_
[Link] used to make random experiences repeatable
____
real data shape irl ? dbscan more realistic par rapport lel lokhrin? khatr clusters mch sphériques
reality is very complex, the k means clusters method, w hata hierarchical, fictitious kinda w mch realistic bcp

jtyqdde
vov3mfa
__________________________
spectral
distance matrix between indiv? 3aksha similarity matrix (tkoul bch tinversi matrix?)
fama haja esmha similarity matrix
bch tekhdm l matrix :
cosinus similarity?
tekhedh
indiv 1 indiv 2 ;;;;
indiv1
indiv2
.
.
.
.
.
indiv n

nehsb similarities edhoukom

baad nelka diagonal matrix kolha zeros ella diagonal feha di ? eli heya sum mtaa kol row f similarity matrix( sum taa
raws) yetsama degree?
yaamel laplace transformation? A-d w yaamel al matrix jdida yaamel pca w yekhou k best components w ytabak
alehom k means
yehseb similarity matrix
transform similarity matrix baad ma yehsbha
yekhou principal components selon eigenvalues
w ytabak aleha any clustering algo
__
tutorial spectrumclustering with R (rpub)
package kernlab
fel R
library (kernlab)
data(spirals)
plot(spirals[,1:2])
specc(spirals[,1:2], 2)

derivatives taa DBSCAN

fama H DBSCAN (derivative taaa dbscan)
_______________________________

spectral key faza : finding irregular shaped clustrs that are not convex (strum ? des .. convexes??)
____________________________________________________________________
SUPERVISED !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ML fih supervised w unsupervised
(el semi out of scope)
eli krineh clustering w pca w association rule mining jkolhom UNSUPERVISED learning
ma fihomch taamel learning bch tpredicti haja le taamel f learning w exploration bch talka hajet enti ma tarafhech
fel supervised however taamel flearning m data to predict haja enti taarfha (example : price taa stock)
ki nkoul classification it means eli target w fama predictors eli gon be used to explain the variability of the target
ken categorical classification
ken numerical regreession
w fama binary classification w multinomial classification (ken talka akther men 2 classes)
kima iris data set feha 3
mamnou3 bch tkhalet bin supervised w unsupervised w regression w classification
K nearest neighbour
vote principal?
knn kifeh?
andi 2 classes 1 w 2
andi new observation
nehseb KNN disons k=5, kNN nchoufou chkoun el 5 eli akreb leha chkn akrblha men class 1 wala 2 ntab3 lvote
fama zeda weighted knn txwali depends on the distance (derivative lel KNN)
pk lazy?
khtr dima lezmk taawed tehseb kol chay men jdid
hedhi ma tkharjch rule kima par exemple neurarl networks w decision trees
KNN lel numerical predictors, w permitted tebda andek some dummy features ( gene kol chay numerical ken gender )
ama heya aslha lel numerical predictors
test set

object oriented programming principle ????

KNN for classification
target : categorical
el average mtaa knn yaati el prediction? waktha nestaamel knn forregression
chneya strong point? ki bch tabak f regression ama famech assumptions fama ken k tamlou tuning wala tchhouf for
each k dra (nsit chneya tchouf hhhh) yodhherli estimation

knn taamel anything askip

any relationship w it’s more realistic khatr reality is very complex
ma yhemnich f assumptions taaa econometrics jmf mel homoscedasticity w correlation w fazet ri9 edheka lkol
knn for regression w estimi ala rouhek
c quoi limitation taa knn?
lazy maaneha yatich rules unlike other models that give rules
hedha kol new obs yaawed yekhdm kol chay melowel
humming distance?
distances kol yemchiw f knn fama argument thot fih nature taa distance ama akther haaj nekhdmou beha ehay
euclidean
fama relevant w non relevant features (khtr ynajmou ykounou fama relevant features, wala noisy eli don’t give any
relevant info wala hata ynajmou yghaltou l modèle ) ama houa yaati nafs l ahameya l all features his is a
LIMITATION
haja okhra limitation eli houa desgined lel numerical predictors ken mkhaltin w kol
par exemple decision tree tkhalethom kol numerical categorical yhmch takhtf ay haja f data
decision tree w random forest?
fel ML it doeesn’t learn ala small data sets
question : waktech we consider data set big enough for a given problem to train a ML algo
réponse : there is no rule to define the optimal data size for a given algo HOWEVER we can have some general
remarks about the data and the nature of the problem that can help us assess whether the data size is large enough
dima data size depends on the complexity level of the problem (strongly correlated) i.e ken cb hnebda men predictor
w noussel target, fama barcha routes, barcha patterns, ken nb of patterns is large (anmat eli tfaser phénomène)
(example abd tala9 faama 10000 reasons) complexity tetlaaa b patterns w ken complexity walet kbira barcha (kima
cours s3ib) donc lezmou barcha kraya. chnoua li ytalaa complexity w anmat? yebda andk nb of class labels high
(multinomial classification) (kima handwriting recognition bch tchouf edhika chneya nb high yeser)ken andk 50
barcha
fel iris data set 150 ykafiw, fel handwriting 1500 ykafiwesh, donc bel e7ses
haja okhra nb of relevant features eli ysehmou f fahm w taksim l profile taa kol class, ken kothrou lrelevant features (dima talka
big tree), features able to explain target data variability
saaat tebda binary w u need a large data set to explain and detect patterns f dara bch tefhemha
hedhouma kol yzidou f complexity
chamaneha relevant features? maaneha data ken feha 1 relevant feature telka shape wadh7a
ken barcha telka data metcha3ba w kol w barcha weird shapes w dwer w kol CHAKCHOUKA
hata ken binary classification tnajem tsir haja haka
w main prob f faza hedhi houa nb of relevant features
intuition mouhema
bahara tindra ? data science R programming
MIT houston? R machine learning? esmou MIT open course web
question
réponse

1.) Detailed Workflow For Predicting Customer Churn in An Online Retail Store
No ratings yet
1.) Detailed Workflow For Predicting Customer Churn in An Online Retail Store
9 pages
Data Science
No ratings yet
Data Science
8 pages
Inthiyas Phase2 PRJ
No ratings yet
Inthiyas Phase2 PRJ
8 pages
Data Analytics Project Guide
No ratings yet
Data Analytics Project Guide
5 pages
Phase-1 Project Rakshya.K (IT)
No ratings yet
Phase-1 Project Rakshya.K (IT)
8 pages
Nimish
No ratings yet
Nimish
4 pages
AMA - Theory Notes
No ratings yet
AMA - Theory Notes
5 pages
Each Stage of A Data Mining Project
No ratings yet
Each Stage of A Data Mining Project
5 pages
Phase-2 (1) .Docx - Abi
No ratings yet
Phase-2 (1) .Docx - Abi
11 pages
Aiml MP
No ratings yet
Aiml MP
16 pages
Daa 01
No ratings yet
Daa 01
11 pages
Churn Prediction and ML
No ratings yet
Churn Prediction and ML
9 pages
Varshini Phase 2
No ratings yet
Varshini Phase 2
19 pages
INNOVATION - PDF Phrase 2
No ratings yet
INNOVATION - PDF Phrase 2
9 pages
IBM Data Science Project - Round2
No ratings yet
IBM Data Science Project - Round2
32 pages
Rakshana SN - LAQ Week 5 MA
No ratings yet
Rakshana SN - LAQ Week 5 MA
3 pages
Data-Driven Marketing Projects Guide
No ratings yet
Data-Driven Marketing Projects Guide
3 pages
Customer Churn Prediction with Python
No ratings yet
Customer Churn Prediction with Python
4 pages
Abhijitya Midsem
No ratings yet
Abhijitya Midsem
6 pages
Data Science Textbook
No ratings yet
Data Science Textbook
7 pages
Synopsis Customer
No ratings yet
Synopsis Customer
12 pages
Customer Churn Insights
No ratings yet
Customer Churn Insights
8 pages
Major Project
No ratings yet
Major Project
27 pages
DataMining-Handouts1 4
No ratings yet
DataMining-Handouts1 4
3 pages
Machine Learning
No ratings yet
Machine Learning
7 pages
Machine Learning Applications Guide
No ratings yet
Machine Learning Applications Guide
9 pages
Pa Unit 2
No ratings yet
Pa Unit 2
6 pages
AI & Business Analytics Solutions
No ratings yet
AI & Business Analytics Solutions
112 pages
Financial Churn Modeling
No ratings yet
Financial Churn Modeling
20 pages
Business Problem
No ratings yet
Business Problem
10 pages
Data Mining
No ratings yet
Data Mining
48 pages
Writeup On Bank Customer Churn Prediction
No ratings yet
Writeup On Bank Customer Churn Prediction
14 pages
Full Text 01
No ratings yet
Full Text 01
26 pages
MtechEvening DS Class Discussion
No ratings yet
MtechEvening DS Class Discussion
15 pages
What Is Data Mining - Key Techniques & Examples
No ratings yet
What Is Data Mining - Key Techniques & Examples
21 pages
Recent Incidents Involving The WhatsApp Accounts of S
No ratings yet
Recent Incidents Involving The WhatsApp Accounts of S
4 pages
Unit 3 Data Science
No ratings yet
Unit 3 Data Science
7 pages
What Is Machine Learning Resume of Video 1
No ratings yet
What Is Machine Learning Resume of Video 1
10 pages
Business Analytics Course
No ratings yet
Business Analytics Course
11 pages
Soln Architecture11.
No ratings yet
Soln Architecture11.
5 pages
Lecture 1 Introduction PM
No ratings yet
Lecture 1 Introduction PM
21 pages
Data Ming Unit 2
No ratings yet
Data Ming Unit 2
8 pages
Business Data Mining - Week 8 - LAQ
No ratings yet
Business Data Mining - Week 8 - LAQ
5 pages
Data Science Life Cycle
No ratings yet
Data Science Life Cycle
12 pages
Data Science and Analysis Steps Guide
No ratings yet
Data Science and Analysis Steps Guide
2 pages
Data Science Case Report
No ratings yet
Data Science Case Report
20 pages
Unit 3 Feature Generation & Selection
No ratings yet
Unit 3 Feature Generation & Selection
11 pages
Data Warehousing & Data Mining Unit-3 Notes
No ratings yet
Data Warehousing & Data Mining Unit-3 Notes
27 pages
Data Mining
No ratings yet
Data Mining
7 pages
Data Mining
No ratings yet
Data Mining
4 pages
DMT Unit1
No ratings yet
DMT Unit1
46 pages
Ba Unit 3 Own
No ratings yet
Ba Unit 3 Own
7 pages
Unit No 3
No ratings yet
Unit No 3
10 pages
Churn Rate Analysis for E-Business
No ratings yet
Churn Rate Analysis for E-Business
30 pages
Unsupervised Learning in Data Mining
No ratings yet
Unsupervised Learning in Data Mining
9 pages
Viva Preparation Notes
No ratings yet
Viva Preparation Notes
6 pages
Chapter 11 Supply-Chain Management
No ratings yet
Chapter 11 Supply-Chain Management
2 pages
Summ Chap3
No ratings yet
Summ Chap3
6 pages
Read MKT 2
No ratings yet
Read MKT 2
29 pages
ML All Chapters
No ratings yet
ML All Chapters
60 pages
Chapter4-Blockchain Application Design
No ratings yet
Chapter4-Blockchain Application Design
17 pages
Machine Learning
No ratings yet
Machine Learning
13 pages
Chapter1 Introduction Java 2024
No ratings yet
Chapter1 Introduction Java 2024
61 pages
Game All Chapters
No ratings yet
Game All Chapters
51 pages
Machine Learning
No ratings yet
Machine Learning
9 pages
Stuvia 2830580 Test Bank For International Economics Theory and Policy 11th Edition Krugman All Chapters 1 22 Full Complete 2023 2024 1.Pdf5
100% (1)
Stuvia 2830580 Test Bank For International Economics Theory and Policy 11th Edition Krugman All Chapters 1 22 Full Complete 2023 2024 1.Pdf5
9 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Clustering
No ratings yet
Clustering
3 pages
K-Means Clustering
No ratings yet
K-Means Clustering
5 pages
Exponential Smoothing Ovherview
No ratings yet
Exponential Smoothing Ovherview
4 pages
IT310 Networking Course Overview
No ratings yet
IT310 Networking Course Overview
5 pages
Understanding Spectral Clustering Method
No ratings yet
Understanding Spectral Clustering Method
4 pages
Big-O, Big-Ω, and Big-Θ Notation Explained
No ratings yet
Big-O, Big-Ω, and Big-Θ Notation Explained
104 pages
Chapter 1summary Request
No ratings yet
Chapter 1summary Request
4 pages
NoteGPT Flashcards 1739123443917
No ratings yet
NoteGPT Flashcards 1739123443917
10 pages
International Trade Insights - Scholarly Flashcards
No ratings yet
International Trade Insights - Scholarly Flashcards
4 pages
PracticeQuestions Final
No ratings yet
PracticeQuestions Final
9 pages
Chap1-2 (IA) Complexity - Examples
No ratings yet
Chap1-2 (IA) Complexity - Examples
167 pages
누리-세종학당 온라인 한국어 레벨테스트 시스템 Test
No ratings yet
누리-세종학당 온라인 한국어 레벨테스트 시스템 Test
2 pages
Snapshot On Kellton Tech 27.02.2024
No ratings yet
Snapshot On Kellton Tech 27.02.2024
38 pages
Sajid Sheikh: Data Science Profile
No ratings yet
Sajid Sheikh: Data Science Profile
1 page
Instant Access To R Visualizations: Derive Meaning From Data 1st Edition David Gerbing Ebook Full Chapters
No ratings yet
Instant Access To R Visualizations: Derive Meaning From Data 1st Edition David Gerbing Ebook Full Chapters
55 pages
Philippine Skills Framework For Analytics and Artificial Intelligence
100% (2)
Philippine Skills Framework For Analytics and Artificial Intelligence
149 pages
SPSS Modeler Level 2 Quiz Review
No ratings yet
SPSS Modeler Level 2 Quiz Review
13 pages
Examination Scheme For Back & Special Back Paper (January - 2024)
No ratings yet
Examination Scheme For Back & Special Back Paper (January - 2024)
75 pages
Roles in Data - Learn - Microsoft Docs
No ratings yet
Roles in Data - Learn - Microsoft Docs
4 pages
Unit 1 PPT 1
100% (1)
Unit 1 PPT 1
27 pages
Abhishek Singh Resume
No ratings yet
Abhishek Singh Resume
1 page
Hack The Future-Poster-3
No ratings yet
Hack The Future-Poster-3
1 page
Google Cloud Platform For Data Science: A Crash Course On Big Data, Machine Learning, and Data Analytics Services Dr. Shitalkumar R. Sukhdeve Full Access
No ratings yet
Google Cloud Platform For Data Science: A Crash Course On Big Data, Machine Learning, and Data Analytics Services Dr. Shitalkumar R. Sukhdeve Full Access
117 pages
Big Data Review
No ratings yet
Big Data Review
8 pages
The United States Environmental Protection Agency
No ratings yet
The United States Environmental Protection Agency
2 pages
Okikiola Balogun's Resume
No ratings yet
Okikiola Balogun's Resume
2 pages
R22-B Tech CSE (DS)
No ratings yet
R22-B Tech CSE (DS)
10 pages
Resume - Ayan Majumdar
No ratings yet
Resume - Ayan Majumdar
4 pages
Data Science Analytics Module
No ratings yet
Data Science Analytics Module
5 pages
Data Science and Analytics: An Overview From Data Driven Smart Computing, Decision Making and Applications Perspective
No ratings yet
Data Science and Analytics: An Overview From Data Driven Smart Computing, Decision Making and Applications Perspective
22 pages
Amity BBA
No ratings yet
Amity BBA
16 pages
Churn Prediction in Telecom Industry Using R: Manpreet Kaur, Dr. Prerna Mahajan
No ratings yet
Churn Prediction in Telecom Industry Using R: Manpreet Kaur, Dr. Prerna Mahajan
8 pages
Andrews M. Doing Data Science in R. An Introduction... 2021
No ratings yet
Andrews M. Doing Data Science in R. An Introduction... 2021
486 pages
Grade IX AI Complete Notes
No ratings yet
Grade IX AI Complete Notes
39 pages
Business Analyst Master's Program in Collaboration With IBM V10
No ratings yet
Business Analyst Master's Program in Collaboration With IBM V10
27 pages
Data Science Essentials Study Guide
No ratings yet
Data Science Essentials Study Guide
40 pages
Data Scientist Jobs at LTIMindtree
No ratings yet
Data Scientist Jobs at LTIMindtree
5 pages
Prashant Kumar: Personal Information Academics
No ratings yet
Prashant Kumar: Personal Information Academics
1 page
Varun Donadi Resume
No ratings yet
Varun Donadi Resume
1 page
Sudhanshu Rajesh Wani Resume
No ratings yet
Sudhanshu Rajesh Wani Resume
3 pages
Data Science Structure
No ratings yet
Data Science Structure
1 page
Samuel Cantrell
No ratings yet
Samuel Cantrell
7 pages

Data Analysis 1

Uploaded by

Data Analysis 1

Uploaded by

Session1

stat 1 w stat 2 la econometrics kol chay ici

Here's a step-by-step explanation of churn analysis:

*SLIDE MTAAA L SIMULATION NHOTHA FEL R

probability sampling techniques

2 main approaches in sampling

​ spectral clustering algorithm

specc(data, k) yekhdm non convex clusters

● complete link criteria

by using single link criteria

f nb of clusters soit nestaamlou distances between clusters wala inside

k means applied to this?

nehsb similarities edhoukom

derivatives taa DBSCAN

object oriented programming principle ????

knn taamel anything askip

You might also like

spectral clustering algorithm