R Random Forest Tutorial s primjerom

ล to je Random Forest u R?

Nasumiฤne ลกume temelje se na jednostavnoj ideji: 'mudrosti gomile'. Zbir rezultata viลกestrukih prediktora daje bolje predviฤ‘anje od najboljeg pojedinaฤnog prediktora. Skupina prediktora naziva se an ansambl. Dakle, ova tehnika se zove Uฤenje ansambla.

U prethodnom vodiฤu nauฤili ste kako koristiti Stabla odluฤivanja napraviti binarno predviฤ‘anje. Kako bismo poboljลกali svoju tehniku, moลพemo trenirati grupu Klasifikatori stabla odluฤivanja, svaki na razliฤitom sluฤajnom podskupu skupa vlakova. Da bismo napravili predviฤ‘anje, samo dobivamo predviฤ‘anja svih pojedinaฤnih stabala, zatim predviฤ‘amo klasu koja dobiva najviลกe glasova. Ova tehnika se zove Sluฤajna ลกuma.

Korak 1) Uvezite podatke

Kako biste bili sigurni da imate isti skup podataka kao u vodiฤu za stabla odluฤivanja, test vlaka i ispitni skup pohranjeni su na internetu. Moลพete ih uvesti bez ikakvih promjena.

library(dplyr)
data_train <- read.csv("https://raw.githubusercontent.com/guru99-edu/R-Programming/master/train.csv")
glimpse(data_train)
data_test <- read.csv("https://raw.githubusercontent.com/guru99-edu/R-Programming/master/test.csv") 
glimpse(data_test)

Korak 2) Uvjeลพbajte model

Jedan od naฤina da se ocijeni izvedba modela je da se uvjeลพba na viลกe razliฤitih manjih skupova podataka i da se oni ocijene na drugom manjem skupu za testiranje. Ovo se zove F-fold unakrsna provjera valjanosti znaฤajka. R ima funkciju za nasumiฤno dijeljenje skupova podataka gotovo iste veliฤine. Na primjer, ako je k=9, model se procjenjuje u devet mapa i testira na preostalom testnom skupu. Ovaj se postupak ponavlja dok se ne procijene svi podskupovi. Ova tehnika se naลกiroko koristi za odabir modela, posebno kada model ima parametre za podeลกavanje.

Sada kada imamo naฤin da ocijenimo naลก model, moramo smisliti kako odabrati parametre koji najbolje generaliziraju podatke.

Nasumiฤna ลกuma odabire nasumiฤni podskup znaฤajki i gradi mnoga stabla odluฤivanja. Model izraฤunava prosjek svih predviฤ‘anja stabala odluka.

Sluฤajna ลกuma ima neke parametre koji se mogu promijeniti kako bi se poboljลกala generalizacija predviฤ‘anja. Koristit ฤ‡ete funkciju RandomForest() za obuku modela.

Sintaksa za Randon Forest je

RandomForest(formula, ntree=n, mtry=FALSE, maxnodes = NULL)
Arguments:
- Formula: Formula of the fitted model
- ntree: number of trees in the forest
- mtry: Number of candidates draw to feed the algorithm. By default, it is the square of the number of columns.
- maxnodes: Set the maximum amount of terminal nodes in the forest
- importance=TRUE: Whether independent variables importance in the random forest be assessed

biljeลกke: Nasumiฤna ลกuma moลพe se trenirati na viลกe parametara. Moลพete se obratiti na vinjeta kako biste vidjeli razliฤite parametre.

Ugaฤ‘anje modela vrlo je naporan posao. Postoji mnogo moguฤ‡ih kombinacija izmeฤ‘u parametara. Ne morate nuลพno imati vremena da ih sve isprobate. Dobra alternativa je pustiti stroj da pronaฤ‘e najbolju kombinaciju za vas. Dostupne su dvije metode:

  • Nasumiฤno pretraลพivanje
  • Mreลพno pretraลพivanje

Definirat ฤ‡emo obje metode, ali ฤ‡emo tijekom poduke uvjeลพbati model pomoฤ‡u pretraลพivanja mreลพe

Definicija pretraลพivanja mreลพe

Metoda pretraลพivanja mreลพe je jednostavna, model ฤ‡e se procijeniti preko svih kombinacija koje proslijedite u funkciji, koristeฤ‡i unakrsnu provjeru.

Na primjer, ลพelite isprobati model s 10, 20, 30 stabala i svako ฤ‡e se stablo testirati tijekom broja mtry jednakih 1, 2, 3, 4, 5. Zatim ฤ‡e stroj testirati 15 razliฤitih modela:

    .mtry ntrees
 1      1     10
 2      2     10
 3      3     10
 4      4     10
 5      5     10
 6      1     20
 7      2     20
 8      3     20
 9      4     20
 10     5     20
 11     1     30
 12     2     30
 13     3     30
 14     4     30
 15     5     30	

Algoritam ฤ‡e procijeniti:

RandomForest(formula, ntree=10, mtry=1)
RandomForest(formula, ntree=10, mtry=2)
RandomForest(formula, ntree=10, mtry=3)
RandomForest(formula, ntree=20, mtry=2)
...

Svaki put nasumiฤna ลกuma eksperimentira s unakrsnom provjerom. Jedan nedostatak pretraลพivanja mreลพe je broj pokusa. Vrlo lako moลพe postati eksplozivan kada je broj kombinacija velik. Da biste rijeลกili ovaj problem, moลพete upotrijebiti nasumiฤno pretraลพivanje

Definicija nasumiฤnog pretraลพivanja

Velika razlika izmeฤ‘u nasumiฤnog pretraลพivanja i pretraลพivanja mreลพe je ลกto nasumiฤno pretraลพivanje neฤ‡e procijeniti sve kombinacije hiperparametara u prostoru pretraลพivanja. Umjesto toga, nasumiฤno ฤ‡e odabrati kombinaciju pri svakoj iteraciji. Prednost je niลพi troลกak raฤunanja.

Postavite kontrolni parametar

Za izradu i procjenu modela postupit ฤ‡ete na sljedeฤ‡i naฤin:

  • Procijenite model sa zadanom postavkom
  • Pronaฤ‘ite najbolji broj mtry
  • Pronaฤ‘ite najbolji broj maksimalnih ฤvorova
  • Pronaฤ‘ite najbolji broj n-stabala
  • Ocijenite model na testnom skupu podataka

Prije nego poฤnete s istraลพivanjem parametara, trebate instalirati dvije biblioteke.

  • caret: R biblioteka strojnog uฤenja. Ako imate instalirati R s r-bitnim. Veฤ‡ je u knjiลพnici
  • e1071: R biblioteka strojnog uฤenja.

Moลพete ih uvesti zajedno s RandomForestom

library(randomForest)
library(caret)
library(e1071)

Tvorniฤke postavke

Unakrsnu provjeru K-preklopa kontrolira funkcija trainControl().

trainControl(method = "cv", number = n, search ="grid")
arguments
- method = "cv": The method used to resample the dataset. 
- number = n: Number of folders to create
- search = "grid": Use the search grid method. For randomized method, use "grid"
Note: You can refer to the vignette to see the other arguments of the function.

Moลพete pokuลกati pokrenuti model sa zadanim parametrima i vidjeti ocjenu toฤnosti.

biljeลกke: Koristit ฤ‡ete iste kontrole tijekom cijelog poduฤavanja.

# Define the control
trControl <- trainControl(method = "cv",
    number = 10,
    search = "grid")

Za procjenu svog modela koristit ฤ‡ete biblioteku karata. Knjiลพnica ima jednu funkciju koja se zove train() za procjenu gotovo svih stroj za uฤenje algoritam. Recimo drugaฤije, ovu funkciju moลพete koristiti za treniranje drugih algoritama.

Osnovna sintaksa je:

train(formula, df, method = "rf", metric= "Accuracy", trControl = trainControl(), tuneGrid = NULL)
argument
- `formula`: Define the formula of the algorithm
- `method`: Define which model to train. Note, at the end of the tutorial, there is a list of all the models that can be trained
- `metric` = "Accuracy": Define how to select the optimal model
- `trControl = trainControl()`: Define the control parameters
- `tuneGrid = NULL`: Return a data frame with all the possible combination

Pokuลกajmo izgraditi model sa zadanim vrijednostima.

set.seed(1234)
# Run the model
rf_default <- train(survived~.,
    data = data_train,
    method = "rf",
    metric = "Accuracy",
    trControl = trControl)
# Print the results
print(rf_default)

Objaลกnjenje koda

  • trainControl(method=โ€cvโ€, broj=10, search=โ€gridโ€): Procijenite model s pretraลพivanjem mreลพe od 10 mapa
  • treniraj(โ€ฆ): treniraj sluฤajni model ลกume. Najbolji model bira se s mjerom toฤnosti.

Izlaz:

## Random Forest 
## 
## 836 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 753, 752, 753, 752, 752, 752, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7919248  0.5536486
##    6    0.7811245  0.5391611
##   10    0.7572002  0.4939620
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

Algoritam koristi 500 stabala i testirao je tri razliฤite vrijednosti mtry: 2, 6, 10.

Konaฤna vrijednost koriลกtena za model bila je mtry = 2 s toฤnoลกฤ‡u od 0.78. Pokuลกajmo postiฤ‡i veฤ‡i rezultat.

Korak 2) Potraลพite najbolju mtry

Model moลพete testirati s vrijednostima mtry od 1 do 10

set.seed(1234)
tuneGrid <- expand.grid(.mtry = c(1: 10))
rf_mtry <- train(survived~.,
    data = data_train,
    method = "rf",
    metric = "Accuracy",
    tuneGrid = tuneGrid,
    trControl = trControl,
    importance = TRUE,
    nodesize = 14,
    ntree = 300)
print(rf_mtry)

Objaลกnjenje koda

  • tuneGrid <- expand.grid(.mtry=c(3:10)): Konstruirajte vektor s vrijednoลกฤ‡u od 3:10

Konaฤna vrijednost koriลกtena za model bila je mtry = 4.

Izlaz:

## Random Forest 
## 
## 836 samples
##   7 predictor
##   2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 753, 752, 753, 752, 752, 752, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    1    0.7572576  0.4647368
##    2    0.7979346  0.5662364
##    3    0.8075158  0.5884815
##    4    0.8110729  0.5970664
##    5    0.8074727  0.5900030
##    6    0.8099111  0.5949342
##    7    0.8050918  0.5866415
##    8    0.8050918  0.5855399
##    9    0.8050631  0.5855035
##   10    0.7978916  0.5707336
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 4.

Najbolja vrijednost mtry pohranjena je u:

rf_mtry$bestTune$mtry

Moลพete ga pohraniti i koristiti kada trebate podesiti ostale parametre.

max(rf_mtry$results$Accuracy)

Izlaz:

## [1] 0.8110729
best_mtry <- rf_mtry$bestTune$mtry 
best_mtry

Izlaz:

## [1] 4

Korak 3) Potraลพite najbolje maksimalne ฤvorove

Morate stvoriti petlju za procjenu razliฤitih vrijednosti maksimalnih ฤvorova. U sljedeฤ‡em kodu ฤ‡ete:

  • Stvorite popis
  • Kreirajte varijablu s najboljom vrijednoลกฤ‡u parametra mtry; Obavezno
  • Napravite petlju
  • Pohrani trenutnu vrijednost maxnode
  • Saลพmite rezultate
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(5: 15)) {
    set.seed(1234)
    rf_maxnode <- train(survived~.,
        data = data_train,
        method = "rf",
        metric = "Accuracy",
        tuneGrid = tuneGrid,
        trControl = trControl,
        importance = TRUE,
        nodesize = 14,
        maxnodes = maxnodes,
        ntree = 300)
    current_iteration <- toString(maxnodes)
    store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry)

Objaลกnjenje koda:

  • store_maxnode <- list(): Rezultati modela bit ฤ‡e pohranjeni na ovoj listi
  • expand.grid(.mtry=best_mtry): Koristite najbolju vrijednost mtry
  • for (maxnodes in c(15:25)) { โ€ฆ }: Izraฤunajte model s vrijednostima maxnodes poฤevลกi od 15 do 25.
  • maxnodes=maxnodes: Za svaku iteraciju, maxnodes je jednak trenutnoj vrijednosti maxnodes. tj. 15, 16, 17, โ€ฆ
  • kljuฤ <- toString(maxnodes): Pohrani kao string varijablu vrijednost maxnode.
  • store_maxnode[[kljuฤ]] <- rf_maxnode: Spremi rezultat modela na popis.
  • resamples(store_maxnode): Rasporedi rezultate modela
  • summary(results_mtry): Ispis saลพetka svih kombinacija.

Izlaz:

## 
## Call:
## summary.resamples(object = results_mtry)
## 
## Models: 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 
## Number of resamples: 10 
## 
## Accuracy 
##         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 5  0.6785714 0.7529762 0.7903758 0.7799771 0.8168388 0.8433735    0
## 6  0.6904762 0.7648810 0.7784710 0.7811962 0.8125000 0.8313253    0
## 7  0.6904762 0.7619048 0.7738095 0.7788009 0.8102410 0.8333333    0
## 8  0.6904762 0.7627295 0.7844234 0.7847820 0.8184524 0.8433735    0
## 9  0.7261905 0.7747418 0.8083764 0.7955250 0.8258749 0.8333333    0
## 10 0.6904762 0.7837780 0.7904475 0.7895869 0.8214286 0.8433735    0
## 11 0.7023810 0.7791523 0.8024240 0.7943775 0.8184524 0.8433735    0
## 12 0.7380952 0.7910929 0.8144005 0.8051205 0.8288511 0.8452381    0
## 13 0.7142857 0.8005952 0.8192771 0.8075158 0.8403614 0.8452381    0
## 14 0.7380952 0.7941050 0.8203528 0.8098967 0.8403614 0.8452381    0
## 15 0.7142857 0.8000215 0.8203528 0.8075301 0.8378873 0.8554217    0
## 
## Kappa 
##         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 5  0.3297872 0.4640436 0.5459706 0.5270773 0.6068751 0.6717371    0
## 6  0.3576471 0.4981484 0.5248805 0.5366310 0.6031287 0.6480921    0
## 7  0.3576471 0.4927448 0.5192771 0.5297159 0.5996437 0.6508314    0
## 8  0.3576471 0.4848320 0.5408159 0.5427127 0.6200253 0.6717371    0
## 9  0.4236277 0.5074421 0.5859472 0.5601687 0.6228626 0.6480921    0
## 10 0.3576471 0.5255698 0.5527057 0.5497490 0.6204819 0.6717371    0
## 11 0.3794326 0.5235007 0.5783191 0.5600467 0.6126720 0.6717371    0
## 12 0.4460432 0.5480930 0.5999072 0.5808134 0.6296780 0.6717371    0
## 13 0.4014252 0.5725752 0.6087279 0.5875305 0.6576219 0.6678832    0
## 14 0.4460432 0.5585005 0.6117973 0.5911995 0.6590982 0.6717371    0
## 15 0.4014252 0.5689401 0.6117973 0.5867010 0.6507194 0.6955990    0

Posljednja vrijednost maxnode ima najveฤ‡u toฤnost. Moลพete pokuลกati s viลกim vrijednostima da vidite moลพete li dobiti veฤ‡i rezultat.

store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = best_mtry)
for (maxnodes in c(20: 30)) {
    set.seed(1234)
    rf_maxnode <- train(survived~.,
        data = data_train,
        method = "rf",
        metric = "Accuracy",
        tuneGrid = tuneGrid,
        trControl = trControl,
        importance = TRUE,
        nodesize = 14,
        maxnodes = maxnodes,
        ntree = 300)
    key <- toString(maxnodes)
    store_maxnode[[key]] <- rf_maxnode
}
results_node <- resamples(store_maxnode)
summary(results_node)

Izlaz:

## 
## Call:
## summary.resamples(object = results_node)
## 
## Models: 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30 
## Number of resamples: 10 
## 
## Accuracy 
##         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 20 0.7142857 0.7821644 0.8144005 0.8075301 0.8447719 0.8571429    0
## 21 0.7142857 0.8000215 0.8144005 0.8075014 0.8403614 0.8571429    0
## 22 0.7023810 0.7941050 0.8263769 0.8099254 0.8328313 0.8690476    0
## 23 0.7023810 0.7941050 0.8263769 0.8111302 0.8447719 0.8571429    0
## 24 0.7142857 0.7946429 0.8313253 0.8135112 0.8417599 0.8690476    0
## 25 0.7142857 0.7916667 0.8313253 0.8099398 0.8408635 0.8690476    0
## 26 0.7142857 0.7941050 0.8203528 0.8123207 0.8528758 0.8571429    0
## 27 0.7023810 0.8060456 0.8313253 0.8135112 0.8333333 0.8690476    0
## 28 0.7261905 0.7941050 0.8203528 0.8111015 0.8328313 0.8690476    0
## 29 0.7142857 0.7910929 0.8313253 0.8087063 0.8333333 0.8571429    0
## 30 0.6785714 0.7910929 0.8263769 0.8063253 0.8403614 0.8690476    0
## 
## Kappa 
##         Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 20 0.3956835 0.5316120 0.5961830 0.5854366 0.6661120 0.6955990    0
## 21 0.3956835 0.5699332 0.5960343 0.5853247 0.6590982 0.6919315    0
## 22 0.3735084 0.5560661 0.6221836 0.5914492 0.6422128 0.7189781    0
## 23 0.3735084 0.5594228 0.6228827 0.5939786 0.6657372 0.6955990    0
## 24 0.3956835 0.5600352 0.6337821 0.5992188 0.6604703 0.7189781    0
## 25 0.3956835 0.5530760 0.6354875 0.5912239 0.6554912 0.7189781    0
## 26 0.3956835 0.5589331 0.6136074 0.5969142 0.6822128 0.6955990    0
## 27 0.3735084 0.5852459 0.6368425 0.5998148 0.6426088 0.7189781    0
## 28 0.4290780 0.5589331 0.6154905 0.5946859 0.6356141 0.7189781    0
## 29 0.4070588 0.5534173 0.6337821 0.5901173 0.6423101 0.6919315    0
## 30 0.3297872 0.5534173 0.6202632 0.5843432 0.6590982 0.7189781    0

Najveฤ‡a ocjena toฤnosti dobiva se s vrijednoลกฤ‡u maxnode jednakom 22.

Korak 4) Potraลพite najbolja stabla

Sada kada imate najbolju vrijednost mtry i maxnode, moลพete podesiti broj stabala. Metoda je potpuno ista kao i maxnode.

store_maxtrees <- list()
for (ntree in c(250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000)) {
    set.seed(5678)
    rf_maxtrees <- train(survived~.,
        data = data_train,
        method = "rf",
        metric = "Accuracy",
        tuneGrid = tuneGrid,
        trControl = trControl,
        importance = TRUE,
        nodesize = 14,
        maxnodes = 24,
        ntree = ntree)
    key <- toString(ntree)
    store_maxtrees[[key]] <- rf_maxtrees
}
results_tree <- resamples(store_maxtrees)
summary(results_tree)

Izlaz:

## 
## Call:
## summary.resamples(object = results_tree)
## 
## Models: 250, 300, 350, 400, 450, 500, 550, 600, 800, 1000, 2000 
## Number of resamples: 10 
## 
## Accuracy 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 250  0.7380952 0.7976190 0.8083764 0.8087010 0.8292683 0.8674699    0
## 300  0.7500000 0.7886905 0.8024240 0.8027199 0.8203397 0.8452381    0
## 350  0.7500000 0.7886905 0.8024240 0.8027056 0.8277623 0.8452381    0
## 400  0.7500000 0.7886905 0.8083764 0.8051009 0.8292683 0.8452381    0
## 450  0.7500000 0.7886905 0.8024240 0.8039104 0.8292683 0.8452381    0
## 500  0.7619048 0.7886905 0.8024240 0.8062914 0.8292683 0.8571429    0
## 550  0.7619048 0.7886905 0.8083764 0.8099062 0.8323171 0.8571429    0
## 600  0.7619048 0.7886905 0.8083764 0.8099205 0.8323171 0.8674699    0
## 800  0.7619048 0.7976190 0.8083764 0.8110820 0.8292683 0.8674699    0
## 1000 0.7619048 0.7976190 0.8121510 0.8086723 0.8303571 0.8452381    0
## 2000 0.7619048 0.7886905 0.8121510 0.8086723 0.8333333 0.8452381    0
## 
## Kappa 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 250  0.4061697 0.5667400 0.5836013 0.5856103 0.6335363 0.7196807    0
## 300  0.4302326 0.5449376 0.5780349 0.5723307 0.6130767 0.6710843    0
## 350  0.4302326 0.5449376 0.5780349 0.5723185 0.6291592 0.6710843    0
## 400  0.4302326 0.5482030 0.5836013 0.5774782 0.6335363 0.6710843    0
## 450  0.4302326 0.5449376 0.5780349 0.5750587 0.6335363 0.6710843    0
## 500  0.4601542 0.5449376 0.5780349 0.5804340 0.6335363 0.6949153    0
## 550  0.4601542 0.5482030 0.5857118 0.5884507 0.6396872 0.6949153    0
## 600  0.4601542 0.5482030 0.5857118 0.5884374 0.6396872 0.7196807    0
## 800  0.4601542 0.5667400 0.5836013 0.5910088 0.6335363 0.7196807    0
## 1000 0.4601542 0.5667400 0.5961590 0.5857446 0.6343666 0.6678832    0
## 2000 0.4601542 0.5482030 0.5961590 0.5862151 0.6440678 0.6656337    0

Imate svoj konaฤni model. Sluฤajnu ลกumu moลพete trenirati sa sljedeฤ‡im parametrima:

  • ntree =800: 800 stabala ฤ‡e biti obuฤeno
  • mtry=4: 4 znaฤajke su odabrane za svaku iteraciju
  • maxnodes = 24: Maksimalno 24 ฤvora u terminalnim ฤvorovima (liลกฤ‡e)
fit_rf <- train(survived~.,
    data_train,
    method = "rf",
    metric = "Accuracy",
    tuneGrid = tuneGrid,
    trControl = trControl,
    importance = TRUE,
    nodesize = 14,
    ntree = 800,
    maxnodes = 24)

Korak 5) Procijenite model

Biblioteฤna oznaka ima funkciju predviฤ‘anja.

predict(model, newdata= df)
argument
- `model`: Define the model evaluated before. 
- `newdata`: Define the dataset to make prediction
prediction <-predict(fit_rf, data_test)

Moลพete koristiti predviฤ‘anje za izraฤunavanje matrice zabune i vidjeti ocjenu toฤnosti

confusionMatrix(prediction, data_test$survived)

Izlaz:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  110  32
##        Yes  11  56
##                                          
##                Accuracy : 0.7943         
##                  95% CI : (0.733, 0.8469)
##     No Information Rate : 0.5789         
##     P-Value [Acc > NIR] : 3.959e-11      
##                                          
##                   Kappa : 0.5638         
##  Mcnemar's Test P-Value : 0.002289       
##                                          
##             Sensitivity : 0.9091         
##             Specificity : 0.6364         
##          Pos Pred Value : 0.7746         
##          Neg Pred Value : 0.8358         
##              Prevalence : 0.5789         
##          Detection Rate : 0.5263         
##    Detection Prevalence : 0.6794         
##       Balanced Accuracy : 0.7727         
##                                          
##        'Positive' Class : No             
## 

Imate toฤnost od 0.7943 posto, ลกto je viลกe od zadane vrijednosti

Korak 6) Vizualizirajte rezultat

Na kraju, moลพete pogledati vaลพnost znaฤajke pomoฤ‡u funkcije varImp(). ฤŒini se da su najvaลพnija obiljeลพja spol i dob. To nije iznenaฤ‘ujuฤ‡e jer ฤ‡e se vaลพne znaฤajke vjerojatno pojaviti bliลพe korijenu stabla, dok ฤ‡e se manje vaลพne znaฤajke ฤesto pojaviti blizu liลกฤ‡a.

varImpPlot(fit_rf)

Izlaz:

varImp(fit_rf)
## rf variable importance
## 
##              Importance
## sexmale         100.000
## age              28.014
## pclassMiddle     27.016
## fare             21.557
## pclassUpper      16.324
## sibsp            11.246
## parch             5.522
## embarkedC         4.908
## embarkedQ         1.420
## embarkedS         0.000		

Rezime

Moลพemo saลพeti kako trenirati i ocijeniti sluฤajnu ลกumu pomoฤ‡u tablice u nastavku:

Knjiลพnica Cilj funkcija Parametar
randomForest Stvorite sluฤajnu ลกumu RandomForest() formula, ntree=n, mtry=FALSE, maxnodes = NULL
znak za umetanje Stvorite unakrsnu provjeru mape K trainControl() metoda = โ€œcvโ€, broj = n, pretraga =โ€mreลพaโ€
znak za umetanje Trenirajte nasumiฤne ลกume vlak() formula, df, metoda = โ€œrfโ€, metrika = โ€œToฤnostโ€, trControl = trainControl(), tuneGrid = NULL
znak za umetanje Predvidjeti izvan uzorka predvidjeti model, novi podaci= df
znak za umetanje Matrica zabune i statistika Matrica zbunjenosti() model, y test
znak za umetanje promjenljiva vaลพnost cvarImp() model

Dodatak

Popis modela koriลกtenih u umetanju

names>(getModelInfo())

Izlaz:

##   [1] "ada"                 "AdaBag"              "AdaBoost.M1"        ##   [4] "adaboost"            "amdai"               "ANFIS"              ##   [7] "avNNet"              "awnb"                "awtan"              ##  [10] "bag"                 "bagEarth"            "bagEarthGCV"        ##  [13] "bagFDA"              "bagFDAGCV"           "bam"                ##  [16] "bartMachine"         "bayesglm"            "binda"              ##  [19] "blackboost"          "blasso"              "blassoAveraged"     ##  [22] "bridge"              "brnn"                "BstLm"              ##  [25] "bstSm"               "bstTree"             "C5.0"               ##  [28] "C5.0Cost"            "C5.0Rules"           "C5.0Tree"           ##  [31] "cforest"             "chaid"               "CSimca"             ##  [34] "ctree"               "ctree2"              "cubist"             ##  [37] "dda"                 "deepboost"           "DENFIS"             ##  [40] "dnn"                 "dwdLinear"           "dwdPoly"            ##  [43] "dwdRadial"           "earth"               "elm"                ##  [46] "enet"                "evtree"              "extraTrees"         ##  [49] "fda"                 "FH.GBML"             "FIR.DM"             ##  [52] "foba"                "FRBCS.CHI"           "FRBCS.W"            ##  [55] "FS.HGD"              "gam"                 "gamboost"           ##  [58] "gamLoess"            "gamSpline"           "gaussprLinear"      ##  [61] "gaussprPoly"         "gaussprRadial"       "gbm_h3o"            ##  [64] "gbm"                 "gcvEarth"            "GFS.FR.MOGUL"       ##  [67] "GFS.GCCL"            "GFS.LT.RS"           "GFS.THRIFT"         ##  [70] "glm.nb"              "glm"                 "glmboost"           ##  [73] "glmnet_h3o"          "glmnet"              "glmStepAIC"         ##  [76] "gpls"                "hda"                 "hdda"               ##  [79] "hdrda"               "HYFIS"               "icr"                ##  [82] "J48"                 "JRip"                "kernelpls"          ##  [85] "kknn"                "knn"                 "krlsPoly"           ##  [88] "krlsRadial"          "lars"                "lars2"              ##  [91] "lasso"               "lda"                 "lda2"               ##  [94] "leapBackward"        "leapForward"         "leapSeq"            ##  [97] "Linda"               "lm"                  "lmStepAIC"          ## [100] "LMT"                 "loclda"              "logicBag"           ## [103] "LogitBoost"          "logreg"              "lssvmLinear"        ## [106] "lssvmPoly"           "lssvmRadial"         "lvq"                ## [109] "M5"                  "M5Rules"             "manb"               ## [112] "mda"                 "Mlda"                "mlp"                ## [115] "mlpKerasDecay"       "mlpKerasDecayCost"   "mlpKerasDropout"    ## [118] "mlpKerasDropoutCost" "mlpML"               "mlpSGD"             ## [121] "mlpWeightDecay"      "mlpWeightDecayML"    "monmlp"             ## [124] "msaenet"             "multinom"            "mxnet"              ## [127] "mxnetAdam"           "naive_bayes"         "nb"                 ## [130] "nbDiscrete"          "nbSearch"            "neuralnet"          ## [133] "nnet"                "nnls"                "nodeHarvest"        ## [136] "null"                "OneR"                "ordinalNet"         ## [139] "ORFlog"              "ORFpls"              "ORFridge"           ## [142] "ORFsvm"              "ownn"                "pam"                ## [145] "parRF"               "PART"                "partDSA"            ## [148] "pcaNNet"             "pcr"                 "pda"                ## [151] "pda2"                "penalized"           "PenalizedLDA"       ## [154] "plr"                 "pls"                 "plsRglm"            ## [157] "polr"                "ppr"                 "PRIM"               ## [160] "protoclass"          "pythonKnnReg"        "qda"                ## [163] "QdaCov"              "qrf"                 "qrnn"               ## [166] "randomGLM"           "ranger"              "rbf"                ## [169] "rbfDDA"              "Rborist"             "rda"                ## [172] "regLogistic"         "relaxo"              "rf"                 ## [175] "rFerns"              "RFlda"               "rfRules"            ## [178] "ridge"               "rlda"                "rlm"                ## [181] "rmda"                "rocc"                "rotationForest"     ## [184] "rotationForestCp"    "rpart"               "rpart1SE"           ## [187] "rpart2"              "rpartCost"           "rpartScore"         ## [190] "rqlasso"             "rqnc"                "RRF"                ## [193] "RRFglobal"           "rrlda"               "RSimca"             ## [196] "rvmLinear"           "rvmPoly"             "rvmRadial"          ## [199] "SBC"                 "sda"                 "sdwd"               ## [202] "simpls"              "SLAVE"               "slda"               ## [205] "smda"                "snn"                 "sparseLDA"          ## [208] "spikeslab"           "spls"                "stepLDA"            ## [211] "stepQDA"             "superpc"             "svmBoundrangeString"## [214] "svmExpoString"       "svmLinear"           "svmLinear2"         ## [217] "svmLinear3"          "svmLinearWeights"    "svmLinearWeights2"  ## [220] "svmPoly"             "svmRadial"           "svmRadialCost"      ## [223] "svmRadialSigma"      "svmRadialWeights"    "svmSpectrumString"  ## [226] "tan"                 "tanSearch"           "treebag"            ## [229] "vbmpRadial"          "vglmAdjCat"          "vglmContRatio"      ## [232] "vglmCumulative"      "widekernelpls"       "WM"                 ## [235] "wsrf"                "xgbLinear"           "xgbTree"            ## [238] "xyf"

Saลพmite ovu objavu uz: