List of models to learn in Machine Learning
1. Linear Regression
2. Ridge Regression
3. K-Neighbors
4. SVM
5. Logistic Regression
6. Lasso
7. Random Forest
8. XGBoost
Topics to cover up
1. ROC AUC Curve (ROC v/s F1 -> Give more priority to ROC)
2. All the metrics
3. Davies Bouldin – Clustering Sklearn – Davies bouldin score
4. Clustering metrics
a. Davies Bouldin (Davies does pointwise comparision whereas silhouette
takes more time. DBI is higher for convex clusters, especially those coming
from DBScan. A Lower DBI is desiarable.
i. DBI Formula
b. Silouette
c. Elbow
5. Lime and shop?
6. Hyper opt
7. Aglomerative
8. K-Means
9. NLP
a. LLM does text generation
b. WSD – Word sense disambiguation(River Bank v/s corporate bank)
c. Basic Structure
i. Featuring
1. Each word becomes 1 token
2. Retain what is going to make the most sense
3. Stop word removal
4. Keep important word which makes the most sense
5. Stemming and Lemetization
a. Stemming
i. Studying
ii. Studies
b. Lematization
i. Lema – Root
ii. As a process it is complicated
iii. When stemming fails, it is a more probabilistic
approach, better for grammer rules.
ii. Vectorising
1. It means mapping
2. Stop word removal
3. Word count frequency -> Hits
4. It is stored in hashmaps which is very secure
5. Lower Scale => Ascii values
iii. Evaluation Matrix
1. Confusion matrix
2. Accuracy
3. KL – Divergence – Generation
iv. 3 Major Sectors
1. Syntax Based (Tokenization, Stemming)
2. Deterministic Task – POS Tagging/Simple text classification
3. Generation Based
v. TOC for NLP (idk if it is imp)
vi. Deterministic involves no randomness
vii. Fitting
1. It is now in sparse matrix
2. Type1 and Type2 errors (Tiwan)
3. TF IDF score (The more the score, the more important whe
word is)
viii. TF (Term Frequency)
1. TF(word, article1)
¿ count of samyukta∈article 1
2. =
¿ number of words∈article 1
ix. IDF (inverse document Frequency)
1. =
number of documents present∈a corpus
log
Number of documents wherethe word Samyukta has appeared
x. Transformers – Has 3 models
10. Study the official Parameter Grid
11. DB Scan
Learn how to use gridsearchcv
Learn how to use random search CV
Learn all CV methods.
Copy notes from all the notebooks
DB Scan
DB Scan is a density based clustering algorithm that is used for unsupervised learning
problems.
In a bid to eliminate the problems to K-Means clustering with nested data and high
dimensional data.
It has 3 important terms and 2 important hyper parameters.
1. Terms
a. Core Point:
It is the center point that has minPts number of datapoints present in its area
and these points under its area can extend the cluster.
b. Non-Core Point:
It is the center point that does not have minPts number of datapoints present
in its area and it cannot extend the cluster
c. Outliers/Noise:
It is the datapoints which are not part of any cluster
2. Hyper-Parameters
a. minPts:
The minimum number of datapoints that need to be present in an area of
point to be considered a core point.
b. Epsilon:
It is the radius of the area of a center point.