0% found this document useful (0 votes)
6 views3 pages

AOML Notes

The document outlines key machine learning models to learn, including Linear Regression, SVM, and Random Forest, as well as essential topics such as ROC AUC Curve, clustering metrics, and NLP techniques. It provides detailed explanations of clustering algorithms like DB Scan, including important terms and hyperparameters. Additionally, it emphasizes the importance of evaluation metrics and methods like GridSearchCV and Random Search CV for model optimization.

Uploaded by

Samyukta G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views3 pages

AOML Notes

The document outlines key machine learning models to learn, including Linear Regression, SVM, and Random Forest, as well as essential topics such as ROC AUC Curve, clustering metrics, and NLP techniques. It provides detailed explanations of clustering algorithms like DB Scan, including important terms and hyperparameters. Additionally, it emphasizes the importance of evaluation metrics and methods like GridSearchCV and Random Search CV for model optimization.

Uploaded by

Samyukta G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

List of models to learn in Machine Learning

1. Linear Regression
2. Ridge Regression
3. K-Neighbors
4. SVM
5. Logistic Regression
6. Lasso
7. Random Forest
8. XGBoost

Topics to cover up
1. ROC AUC Curve (ROC v/s F1 -> Give more priority to ROC)
2. All the metrics
3. Davies Bouldin – Clustering Sklearn – Davies bouldin score
4. Clustering metrics
a. Davies Bouldin (Davies does pointwise comparision whereas silhouette
takes more time. DBI is higher for convex clusters, especially those coming
from DBScan. A Lower DBI is desiarable.
i. DBI Formula
b. Silouette
c. Elbow
5. Lime and shop?
6. Hyper opt
7. Aglomerative
8. K-Means
9. NLP
a. LLM does text generation
b. WSD – Word sense disambiguation(River Bank v/s corporate bank)
c. Basic Structure
i. Featuring
1. Each word becomes 1 token
2. Retain what is going to make the most sense
3. Stop word removal
4. Keep important word which makes the most sense
5. Stemming and Lemetization
a. Stemming
i. Studying
ii. Studies
b. Lematization
i. Lema – Root
ii. As a process it is complicated
iii. When stemming fails, it is a more probabilistic
approach, better for grammer rules.
ii. Vectorising
1. It means mapping
2. Stop word removal
3. Word count frequency -> Hits
4. It is stored in hashmaps which is very secure
5. Lower Scale => Ascii values
iii. Evaluation Matrix
1. Confusion matrix
2. Accuracy
3. KL – Divergence – Generation
iv. 3 Major Sectors
1. Syntax Based (Tokenization, Stemming)
2. Deterministic Task – POS Tagging/Simple text classification
3. Generation Based
v. TOC for NLP (idk if it is imp)
vi. Deterministic involves no randomness
vii. Fitting
1. It is now in sparse matrix
2. Type1 and Type2 errors (Tiwan)
3. TF IDF score (The more the score, the more important whe
word is)
viii. TF (Term Frequency)
1. TF(word, article1)
¿ count of samyukta∈article 1
2. =
¿ number of words∈article 1
ix. IDF (inverse document Frequency)
1. =
number of documents present∈a corpus
log
Number of documents wherethe word Samyukta has appeared
x. Transformers – Has 3 models
10. Study the official Parameter Grid
11. DB Scan

Learn how to use gridsearchcv


Learn how to use random search CV

Learn all CV methods.

Copy notes from all the notebooks


DB Scan
DB Scan is a density based clustering algorithm that is used for unsupervised learning
problems.

In a bid to eliminate the problems to K-Means clustering with nested data and high
dimensional data.

It has 3 important terms and 2 important hyper parameters.


1. Terms
a. Core Point:
It is the center point that has minPts number of datapoints present in its area
and these points under its area can extend the cluster.
b. Non-Core Point:
It is the center point that does not have minPts number of datapoints present
in its area and it cannot extend the cluster
c. Outliers/Noise:
It is the datapoints which are not part of any cluster
2. Hyper-Parameters
a. minPts:
The minimum number of datapoints that need to be present in an area of
point to be considered a core point.
b. Epsilon:
It is the radius of the area of a center point.

You might also like