Session1
stat 1 w stat 2 la econometrics kol chay ici
1. **Machine Learning**: This branch focuses on developing algorithms that enable computers to learn and make predictions or
decisions without being explicitly programmed. It includes various types of learning paradigms such as supervised,
unsupervised, and reinforcement learning.
2. **Statistics**: Statistics forms the foundation of data science. It involves the collection, analysis, interpretation, and
presentation of data. Statistical methods are used to make inferences and draw conclusions from data.
3. **Data Mining**: Data mining involves the process of discovering patterns and relationships in large datasets. It often
involves techniques like clustering, association rule mining, and anomaly detection.
4. **Big Data**: This branch deals with the processing and analysis of extremely large and complex datasets that traditional
data processing techniques may struggle to handle. It includes technologies like Hadoop and Spark.
continuation mteehom
data analysis? data mining? machine learning?
multivariate analysis
umbrella area = data science fiha kol chay
if i calculate x bar machine learning operation = use past info to predict sth unknown?
=> principle of machine learning
book : statistical learning
Data science encompasses a wide range of techniques and methods for extracting knowledge and insights from data. Some of
the main branches of data science include:
5. **Data Visualization**: Data visualization is the practice of representing data in graphical or visual formats to help users
understand the patterns, trends, and insights within the data.
6. **Natural Language Processing (NLP)**: NLP focuses on enabling computers to understand, interpret, and generate human
language. It's used in applications like sentiment analysis, machine translation, and chatbots.
7. **Time Series Analysis**: This branch is concerned with analyzing data that is ordered by time. It's commonly used in fields
like finance, economics, and environmental science to make forecasts and predictions.
8. **Deep Learning**: Deep learning is a subset of machine learning that utilizes deep neural networks with multiple layers to
learn complex features from data. It has been particularly successful in tasks like image recognition and natural language
processing.
9. **Bayesian Methods**: Bayesian statistics is a framework for statistical inference where probability distributions are assigned
to both the parameters and data. It's used for tasks like parameter estimation and uncertainty quantification.
10. **Reinforcement Learning**: This is a type of machine learning where an agent learns to make a sequence of decisions by
interacting with an environment. It's commonly used in areas like robotics and game playing.
11. **Factor Analysis and Dimensionality Reduction**: These techniques aim to reduce the complexity of data by representing it
in a lower-dimensional space while retaining as much of the relevant information as possible.
12. **Optimization**: Optimization methods are used to find the best solution to a problem from a set of possible solutions. It's
widely used in training machine learning models and solving various real-world problems.
13. **Experimental Design**: This branch focuses on designing experiments to collect data in a way that allows for valid and
reliable statistical inference.
These branches often overlap and complement each other. Data scientists draw from various disciplines to select the most
appropriate techniques and methods for a given problem or dataset. The choice of which branch to use depends on the specific
nature of the data and the objectives of the analysis.
supervised : using past available info to make rules and predict the future
machine learning
ibn kholdoun terikh bch taarf mostakbel
principle of econometrics :
supervised learning : we train algo ion available data to predict the future
unsupervised : explore and extract knowledge and useful information from the data
eg techniques used to visualize data : an unsupervised learning method par exemple
eg calculating the variance of a given variable to have an idea about the range of the future of the variable
factor analysis : unsupervised learning
constructing confidence interval
computing correlation between variables kif kif unsupervised
khater we’re exploring the data and trying to get useful info
semi supervised????
Why are these techniques important and why do we use them and eventually make money?
*churn analysis !!!!
Churn analysis, also known as customer attrition analysis, is the process of examining and understanding the rate at which
customers stop using a product, service, or subscription over a specific period of time. The customers who discontinue their use
are referred to as "churned" or "lost" customers.
Here's a step-by-step explanation of churn analysis:
1. **Define the Churn Event**: The first step is to define what constitutes a "churn event" for your specific business. This could
be when a customer cancels a subscription, stops using a service, or makes their last purchase. It's important to have a clear
definition to accurately track and analyze churn.
2. **Data Collection**: Gather relevant data about customer behavior, interactions, and transactions. This data might include
customer sign-up dates, usage patterns, purchase history, customer demographics, customer feedback, and any other relevant
information.
3. **Calculate Churn Rate**: The churn rate is the percentage of customers who have churned within a specific time period. It's
typically calculated using the formula:
This gives you a percentage that indicates how many customers you've lost relative to your total customer base.
4. **Segmentation**: Analyze churn based on different customer segments. For example, you might want to know if churn rates
differ by geographic location, subscription tier, or other demographic factors. This helps identify which segments are more prone
to churn and can inform targeted retention efforts.
5. **Time Period Analysis**: Churn rates may vary over time. Analyze churn on a monthly, quarterly, or yearly basis to identify
trends. Sudden spikes or drops in churn rates might indicate specific events or changes in your business that need further
investigation.
6. **Customer Feedback and Surveys**: Collect feedback from churned customers through surveys or interviews. This
qualitative data can provide valuable insights into the reasons behind their decision to leave. Understanding these reasons can
help in developing strategies to reduce churn.
7. **Root Cause Analysis**: Identify the underlying causes of churn. These could include factors like poor customer service,
product dissatisfaction, better competitive alternatives, pricing issues, or changes in customer circumstances.
8. **Predictive Churn Modeling**: Use machine learning techniques to build predictive models that estimate the likelihood of a
customer churning in the future. These models can help prioritize retention efforts for at-risk customers.
9. **Implement Retention Strategies**: Based on the insights gained from the analysis, develop and implement targeted
retention strategies. This might include improving customer service, enhancing product features, offering incentives, or
adjusting pricing.
10. **Monitor and Iterate**: Continuously monitor churn rates and the effectiveness of your retention strategies. Adjust your
approach as needed to address changing customer behavior and market conditions.
Churn analysis is a critical component of customer relationship management and can have a significant impact on a business's
profitability and long-term success. By understanding why customers leave, businesses can take proactive steps to retain
existing customers and ultimately improve their bottom line.
info from the analysis + customer behaviour => u can make an accurate model
u have to feed and train the algo
orange tamale churn analysis ( w hata l banks)
predict wala forecast automatiquement ⇒ SUPERVISED
*association rule mining technique (conditional proba based uwu)
tkhalik
market basket analysis
● text books
sql, coursera maher challouf?
python
r
factor analysis
k means clustering
agglomerative hierarchical clustering
aleh right nb of clustering w kifeh naafou? (suitable nb of clustering
k-nearest neighbor KNN
markdown *python
*kaggle
[Link]
extreme gradient boosting algorithm
_________________________________
ch : sampling
how to get representative sample?
u can draw any sample
opinion polls = sondage
pk c important sampling?
psk most of the time the population is large (most of the time it’s expensive money and time wise)
census (population kemla)
(A census is the procedure of systematically acquiring, recording and calculating population
information about the members of a given population)
sampling (estimation of the population average ) vs census(working on each observation of the data)
in practice, most of the time census for social studies surtout is impossible
eg avergae ppl’s height
te3ded census
most of the time we work on samples and not populations (except when the population is small)
why, if we draw a large sample, we can use the sample to predict the true average parameter
big nb laws
andna frame : feha info ala every population member ( lezm yabdw andhom equal chances of being selected 1/N?)
sample should inherit the population’s characteristics
(The three rules of the central limit theorem are as follows:
● The data should be sampled randomly.
● The samples should be independent of each other.
● The sample size should be sufficiently large but not exceed 10% of the
population.
)
fel discipline taa stat w proba
narch fair wala unfair
i wanna estimate the probability of having success w failed
*SLIDE MTAAA L SIMULATION NHOTHA FEL R
Session2
if the sample isn’t representative, all info is useless, waste of time and money
representative c quoi?
if it inherits the population’s features, properties, and characteristics
how: choose RANDOMLY, AT RANDOM observations from the population ( thz opposite would be BIASED sampling)
it must be drawn randomly
it must big enough to inherit the population’s characteristics
we use no rules (mch kol jmea tji AT 8 par exemple khtr ynajmou ykounou edhoukom yfikou bekri w andhom kn fazet w hajet le)
RANDOM RANDOM
2 goals: at random w increased sample size
random: means u have the whole population in the same urn
however, in practice, we don’t have that case eli 100% random
in statistics, nkoul we have a frame
the first problem : to get the frame ( wchkhas ken yetlaaa it’s rich in term of information) ( akeka u can divide it into
homogeneous groups)
probability sampling techniques
1. simple random sampling : big urn containing all X (with or without replacement)
#ref loi de bernoulli
pk expensive?
2 main approaches in sampling
part 1 probability sampling techniques ( houni we have the frame: contains ppl’s contacts, but can also other additional info taht
can be used later to stratify the population into many homogenous groups, and that can be useful)
part 2 non probability sampling techniques (most famous w practical quota technique) (if u don’t have the proper frame)
(voir les slides pour es different techniques de sampling)
plage de données tabda kbira barcha l f twali sghira barcha donc it doesn’t matter ken with or without replacement
in a case of
________________________________________________________________________________________________
Post midterms session
supervised : using historical data to forecast w make rules hedhouma lel estimations fama target maaneha econometrics lkol,
time series analysis, classification models,
unsupervised : ken famech targe twala haja u wanna estimate through h genre pca, clustering, association mining
CLUSTERING
there are 2 main concepts
classification (classiification automatique ou non supervisée w lokhra supervisée jst f anglais fama zouz kelmet fr jst non w ey)
(hedha supervised, categorical target? eg: abd bch yanjah wale) vs clustering : data is unlabeled i don’t have a categorical
target to predict a black box? the main objective is to separate/ divide groups of data wala divide it into homogeneous groups
textbook:Machine Learning with R brett lantz version 2019
hands on machine learning with python
the most famous algorithm in machine learning f clustering: K-means algorithm
fama 3 versions ?
K-means is a method used to group or cluster similar data points into distinct groups. The "K" in
K-means refers to the number of clusters you want to create. Here's how it works:
Initialization: Start by randomly placing K points in your data space. These points represent
the initial guesses for the centers of your clusters.
Assignment: Assign each data point to the cluster whose center is closest to it. The
"closeness" is typically measured using the Euclidean distance.
Update: Recalculate the cluster centers by finding the mean of all the data points assigned to
each cluster. These means become the new cluster centers.
Repeat: Repeat steps 2 and 3 until the cluster assignments and the cluster centers no longer
change significantly.
ki f algo yaamel n start déjà yjareb barcha 2 points w yebda b ahsen 2 win fazet intertia
k means ken b numerical
when u standardize data tnajem dra chneya taamel faza related lel categorical ama mch barcha nakhrawesh fih khatr c pas fait
pr ça?
[Link]
python :
[Link]
spectral clustering algorithm
specc(data, k) yekhdm non convex clusters
_____
yentale9 xD men ahsen points eli andha adh3ef sum of squares ?
yekhtarch ay couple yekhtar the best yokod yaawed yekhtar barcha baad ytestihom houma kolhom random w yekhou eli fehom
a9al intertia ( fpython wala R esmha n start w taatih kadeh men points)
andou urne fih 1000 boules
kol wahda feha coordinates taa individual
enti tkolou k = 3 par exemple, yekhou 3 at random yekhdhom w ykayadhom w yraajaahom w yokod yaawed akeka, f espace
ychouf points he chose at random yehseb objective function f kol choice amlou lkol hasb n start eli houa argument kadeh
hatitou par exemple 10 yekhou CHOICES YEKHOU JMEAA LI ANDHOM AKAL SUM OF SQUARES
+++++++
BEHI f k means telka k more? k median?
____________________________________
Hierarchical Clustering
1. distance between sets of individuals
● complete link criteria
d(G1,G2) = d(I8,I5) (ab3ed wahdet)
● average link criteria (average GHALTA ESMHA CENTROID CRITERION !!!!!!!)
d(G1,G2) d(C1,C2) (C1 w C2 houma centers houni 7atitou smiley lol ama li alisar for some reason habech yeta3mal
ama supposed ykoun f west bref )
● simple link criteria
d(G1,G2) d(I10,I3) (melekher a9reb points l baadhhom aks loula )
by using single link criteria
individual 3 w 4 yetsamew akreb jmeaa l baadhhom nehsboum clusters kol wehed wahdou?
namlelhom projection al axe X
❇️ ❇️
aka l faza li amalha (check pic f tel brain ref )
clustering dendogram
cutree function
in statistics, Ward's method is a criterion applied in hierarchical cluster analysis. Ward's minimum variance method is a special
case of the objective function approach originally presented by Joe H. Ward, Jr
méthode des pyramides?
méthode des pyramides edwin didé
_____
mubarki tips 101 xDDDD
___________________
andrew dra chneya coursera deep learning
____
text mining sentiment analysis?
__________________
back
kife naarf which method is better?
classe enchevêtrée
method “complete” we take the 2 farthest individuals
hclust complete
____
euclidian distance
manhatten distance
tuning
site :
[Link]/k-means-clustering-concepts-and-implementation-in-r-for-data-science-32cae6a3ceba
___
so good morning everyone ahna krina clustering
krina clustering
clustering houa haja it’s unsupervised learning
kademt clustering (hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh)
___
agglomerative hierarchical approach?
dendrogram
why clustering data is important n diff? HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH :(
unsupervised is harder than supervised mainly khatr it’s hard to evaluate the results in the unsupervised learning area
asslema maram bjr comment cv
ch score?
paper?
help?
f nb of clusters soit nestaamlou distances between clusters wala inside
criteria
1. homogeneity (compactness) : traditional faza wss: interitia wss bch tkaren compactness taa partitions testaamel
WSS as a criterion
2. isolation tehsb distance bin clusters (andna global center w class center) BSS (between sample sum of square
distances) (isolation forest method)
cooper 1985 ? clustering ( k means)
elbow criteria? highest jumb? beginner fazet w weak w mahomch advanced
ch index advanced cv formula ?
stability approach? ta9yim xD
yaamel data perturbation hhhhhhhhhhhh ynahi par exemple 20% m data w yaawed yaamela clustering wala yaamel individual
replication wala u add white noise :o bch akeka taswira original tra3ech edez edez aywa aywa y9arenha f reference partition
baad yalka partition li lkaaha is more stable than eli fel 4 clusters w houa aaamel 2 wala aks wala idk ‘niveau poussé sehbi
c koi stability approach : ken andk clusters w andk random partition (donc clustering random data) automatically ken data
dekhla baadhha ken taamel perturbation mara telka k 1 mara telka 2 ? ki taamel perturbation clusters yodkhlou baadhhom w
reference partition gets lost during the process
adjusted rand index? l’indice de rand ajusté?
(pca ? spherity test of bart dra chneya)
ma ykoun pca behya ken ma faama chute w fama big differences w gaps f eigenvalues
lezm dima dima taamel evaluation, taamel model performance analysis to make sure eli model jawou behi xD
clustering maghir validation belha w ochreb meha hhhhhhhhhhhhhhhhhhhhhhhhhhhhhh ma ando ma yendeb beha :clown:
clustering validation fib 2 volets
1. enehoua nb of clusters s7i7 ch index, stability approach( taamel perturbation wala sample b white noise)?
2. enehya l méthode s7i7a (complete link? single link? k means? ) which one is best?
ken partition taatik interpretable results w mafhouma, this indicates eli takssim mafhoum w differences wadhhin donc
good indicator
clustering f text? ttekhou barcha pdf files (data / web scrapping?)
tkasamhom l bcp topic (classification)
word embedding !!!!!! *package(worrd veckt? bert? gloov? ? R maadech ntawa neural networks ( R f kol chay cv ken
neural network w deep learning)
keywords?
text mining method? andi text
nnekhdhou frequent words w mch stop words (kiam the )
the words houma features f table
bel frequency taa l kelmet
fel python esmhou ngram
(matrice DTM) dispalacement tracking matrix
text codé? xD
w1 w2 w3 w4
text 1
text 2
text3
text4
k means applied to this?
NUMERical features w ytallaaa wahdou el topivs :o
word cloud.?? :o
nlp
tableau tabaak alih k means edhoukom houma topics
contextualization des words;)))
if the target is numerival staaaml regression
(regression tree, random forest, neural networks)
econometrics ypsleh lel data sets s7a7 ( l classique) machine learning khir quoi
word embedding used for sentiment analysis (a7assis) bch yehsbou similarity bin words fama cosin similarity ?
___________________________________________________
MARKET BASKET ANALYSIS (non supervised learning ) / * associative rule mining
90s bdet
he noticed that fama associations bin products
weather forecasting
association rule mining is very important w kol ama not used bcp HHHHHHH :( underrated me xD
Association rule
A—>B
wala (A,B)-->C
a couple yaati haja wala haja taati haja
P(B/A)=P(AinterB) / P(A) ⇒ confidence index
= count (A and B) / count (A) ( count tnajem trodha suppoer)
(tkhayel fazet words w texts data set ama f oudh words items w oudh text transaction)
transaction method
el items houma products
____________________
purchached
hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
cnn ? networks?
PYTHON GRAPH GALLERY
R GRAPH GALLERY
lift(X=> Y) = supp (X inter Y) / summ (X) * supp (Y)
correlation heya standardized covariance
______
if lift =1 proba of X and Y occurring are indep no rule can be drawn
if the lift >1 , lets us know the degree ti which those occurrences are dependent on one another , useful for predicting the
consequent in future data sets
if the lift is <1 that lets us know the items are substitute to reach other donc pressence of one item is negative lelekher
ahsen data visaulizaiton f assocition rule mining tetekhdm bel R
___________________________________
BDSCAN?
band width?
it chooses any point from th data
par exemple min nb of points 3?
core point l centre mtaa cercle
kol mara circle lezm yalka feha 3
basically tsawer andi barcha points
w keli fama circle w t7arkou
lezm f kol blassa ma yetlawnou l points maaneha cluster ela ma ykounou el 3 par exemple edhoukom
Pros
+ it determines automatically th nb of clusters
+ _______
ckoi l’idée taa DBSCAN? based on intuitive human thinking, genre yoghzer b 3in ychouf density w yaksmou bel3in
ychouf l blayes li fehom barcha density, el dense areas ylemhom together akeka donc min nb of points w epsilon
houa eli bch ykhali tfi9 bihom
ki yalka faraghat w cercle li feha ni9at chwaya ylemhom
LESS SENSITIVE TO OUTLIERS
ylem mabaadhhom k means el single link (normalement ) PAR CONTRE DBSCAN YFI9 BEL OUTLIERS
!!!!!!!!!!!!!!!!!!!!!!!!!
Cons?
- mochkla f tuning of epsilon
fel k means andi l K nb of clusters w fel hierarchical andna criteria
ama houni mochkla kifeh u determine the parameter epsilon (stability approach?
epsilon houa radius
__
error ?
classic techniques can’t do the clustering of non convex shapes
donc lesseeee chneya yaamel DBSCAN lezgo
ken nhot epsilon f haja mouaayna w nchoufou ken it works out wale
smile ref
Density based spatial clustering of application with noise captures the insight that clusters are dense
what happens ken we decrease the nb of points? NAFTALI HARRIS visualizing dbscan clustering
(visualizing dbscan) bch yelka barcha outliers w tokhel baadhha
ahsen approche f dbscan is to analyze partition stability tjarem epsilon 0 u perturbate u cluster u compare to the ref
data etc until u have an idea about how much the ref aprtition was stable then u determine the suitable epsilon range
( very hard)
the simplest thing to do is to do clustering b R wala python
___
tuning f unsupervised as3eb mel supervised :(
mch kima econ model eli houa supervied andk adjusted R square w u can use it
tnajem tebda b epsilon = average mtaa distances bin points? wala moyenne taa std deviations barcha hajet ama
barcha intuitive maaneha nekhou noss std dev mena w mena w nekhou max w nhotou comme quoi epsilon ama at
the end of the day these all are intuitive idea and don’t have strong theoretical bases
ML? lezm telka method deterministic bch naarf chnaamel w kol
ki thot min points 1 ywali yekhdem b single link ? (is it always tbhe case?)
_
el far9 bch yssir fi definition of outliers maaneha ken bcht tnakes f min nb of points kol ma tkoun l possibility taa
fusion bin clusters akber thus chain effect akther? ki tsagher min nb of points maaneha makch exigeant barcha ken
nhot radius 2 unité w nelka point najem nehsbou denseywali ylemhom mabaadhhom ma kontch exigeant maaneha la
yezlm ykoun kbir barcha la sghir barcha par exemple siiib bch talka circle fih 40 points w unit 2 maaneha bch ywali
kol chay outliers sinn
__________________________________________
al riyadhyet wal hadhara? documentary square root of 2?
_____
Site ; [Link] DBSCAN data dbscan sample example
_
[Link] used to make random experiences repeatable
____
real data shape irl ? dbscan more realistic par rapport lel lokhrin? khatr clusters mch sphériques
reality is very complex, the k means clusters method, w hata hierarchical, fictitious kinda w mch realistic bcp
jtyqdde
vov3mfa
__________________________
spectral
distance matrix between indiv? 3aksha similarity matrix (tkoul bch tinversi matrix?)
fama haja esmha similarity matrix
bch tekhdm l matrix :
cosinus similarity?
tekhedh
indiv 1 indiv 2 ;;;;
indiv1
indiv2
.
.
.
.
.
indiv n
nehsb similarities edhoukom
baad nelka diagonal matrix kolha zeros ella diagonal feha di ? eli heya sum mtaa kol row f similarity matrix( sum taa
raws) yetsama degree?
yaamel laplace transformation? A-d w yaamel al matrix jdida yaamel pca w yekhou k best components w ytabak
alehom k means
yehseb similarity matrix
transform similarity matrix baad ma yehsbha
yekhou principal components selon eigenvalues
w ytabak aleha any clustering algo
__
tutorial spectrumclustering with R (rpub)
package kernlab
fel R
library (kernlab)
data(spirals)
plot(spirals[,1:2])
specc(spirals[,1:2], 2)
derivatives taa DBSCAN
fama H DBSCAN (derivative taaa dbscan)
_______________________________
spectral key faza : finding irregular shaped clustrs that are not convex (strum ? des .. convexes??)
____________________________________________________________________
SUPERVISED !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ML fih supervised w unsupervised
(el semi out of scope)
eli krineh clustering w pca w association rule mining jkolhom UNSUPERVISED learning
ma fihomch taamel learning bch tpredicti haja le taamel f learning w exploration bch talka hajet enti ma tarafhech
fel supervised however taamel flearning m data to predict haja enti taarfha (example : price taa stock)
ki nkoul classification it means eli target w fama predictors eli gon be used to explain the variability of the target
ken categorical classification
ken numerical regreession
w fama binary classification w multinomial classification (ken talka akther men 2 classes)
kima iris data set feha 3
mamnou3 bch tkhalet bin supervised w unsupervised w regression w classification
K nearest neighbour
vote principal?
knn kifeh?
andi 2 classes 1 w 2
andi new observation
nehseb KNN disons k=5, kNN nchoufou chkoun el 5 eli akreb leha chkn akrblha men class 1 wala 2 ntab3 lvote
fama zeda weighted knn txwali depends on the distance (derivative lel KNN)
pk lazy?
khtr dima lezmk taawed tehseb kol chay men jdid
hedhi ma tkharjch rule kima par exemple neurarl networks w decision trees
KNN lel numerical predictors, w permitted tebda andek some dummy features ( gene kol chay numerical ken gender )
ama heya aslha lel numerical predictors
test set
object oriented programming principle ????
KNN for classification
target : categorical
el average mtaa knn yaati el prediction? waktha nestaamel knn forregression
chneya strong point? ki bch tabak f regression ama famech assumptions fama ken k tamlou tuning wala tchhouf for
each k dra (nsit chneya tchouf hhhh) yodhherli estimation
knn taamel anything askip
any relationship w it’s more realistic khatr reality is very complex
ma yhemnich f assumptions taaa econometrics jmf mel homoscedasticity w correlation w fazet ri9 edheka lkol
knn for regression w estimi ala rouhek
c quoi limitation taa knn?
lazy maaneha yatich rules unlike other models that give rules
hedha kol new obs yaawed yekhdm kol chay melowel
humming distance?
distances kol yemchiw f knn fama argument thot fih nature taa distance ama akther haaj nekhdmou beha ehay
euclidean
fama relevant w non relevant features (khtr ynajmou ykounou fama relevant features, wala noisy eli don’t give any
relevant info wala hata ynajmou yghaltou l modèle ) ama houa yaati nafs l ahameya l all features his is a
LIMITATION
haja okhra limitation eli houa desgined lel numerical predictors ken mkhaltin w kol
par exemple decision tree tkhalethom kol numerical categorical yhmch takhtf ay haja f data
decision tree w random forest?
fel ML it doeesn’t learn ala small data sets
question : waktech we consider data set big enough for a given problem to train a ML algo
réponse : there is no rule to define the optimal data size for a given algo HOWEVER we can have some general
remarks about the data and the nature of the problem that can help us assess whether the data size is large enough
dima data size depends on the complexity level of the problem (strongly correlated) i.e ken cb hnebda men predictor
w noussel target, fama barcha routes, barcha patterns, ken nb of patterns is large (anmat eli tfaser phénomène)
(example abd tala9 faama 10000 reasons) complexity tetlaaa b patterns w ken complexity walet kbira barcha (kima
cours s3ib) donc lezmou barcha kraya. chnoua li ytalaa complexity w anmat? yebda andk nb of class labels high
(multinomial classification) (kima handwriting recognition bch tchouf edhika chneya nb high yeser)ken andk 50
barcha
fel iris data set 150 ykafiw, fel handwriting 1500 ykafiwesh, donc bel e7ses
haja okhra nb of relevant features eli ysehmou f fahm w taksim l profile taa kol class, ken kothrou lrelevant features (dima talka
big tree), features able to explain target data variability
saaat tebda binary w u need a large data set to explain and detect patterns f dara bch tefhemha
hedhouma kol yzidou f complexity
chamaneha relevant features? maaneha data ken feha 1 relevant feature telka shape wadh7a
ken barcha telka data metcha3ba w kol w barcha weird shapes w dwer w kol CHAKCHOUKA
hata ken binary classification tnajem tsir haja haka
w main prob f faza hedhi houa nb of relevant features
intuition mouhema
bahara tindra ? data science R programming
MIT houston? R machine learning? esmou MIT open course web
question
réponse