BASIC MACHINE LEARNING TERMS
Sure, here are some basic terms that every machine learning engineer should be
familiar with:
1. **Algorithm**: A set of rules or instructions given to a machine to help it
learn on its own.
2. **Training Data**: The dataset used to train the machine learning model.
3. **Test Data**: The dataset used to evaluate the accuracy and performance of the
trained model.
4. **Feature**: An individual measurable property or characteristic used as input
in a model. For example, height and weight can be features in a health-related
model.
5. **Label**: The output variable in supervised learning, which the model is
trained to predict.
6. **Supervised Learning**: A type of machine learning where the model is trained
on labeled data.
7. **Unsupervised Learning**: A type of machine learning where the model is trained
on unlabeled data and tries to find hidden patterns.
8. **Reinforcement Learning**: A type of learning where an agent learns to make
decisions by taking actions in an environment to maximize cumulative reward.
9. **Overfitting**: When a model performs well on training data but poorly on test
data, indicating it has learned the noise in the training data instead of the
underlying pattern.
10. **Underfitting**: When a model performs poorly on both training and test data,
indicating it has not captured the underlying pattern in the data.
11. **Regularization**: Techniques used to prevent overfitting by adding a penalty
for more complex models.
12. **Gradient Descent**: An optimization algorithm used to minimize the loss
function by iteratively adjusting the model parameters.
13. **Hyperparameters**: Parameters whose values are set before the learning
process begins, such as learning rate and number of trees in a random forest.
14. **Confusion Matrix**: A table used to evaluate the performance of a
classification algorithm, showing the true positives, false positives, true
negatives, and false negatives.
15. **ROC Curve**: Receiver Operating Characteristic curve, a graphical
representation of a classifier's performance.
16. **Precision**: The ratio of true positive predictions to the total predicted
positives.
17. **Recall**: The ratio of true positive predictions to the total actual
positives.
18. **F1 Score**: The harmonic mean of precision and recall, used as a single
metric to balance them.
19. **AUC (Area Under the Curve)**: A performance metric for classification
problems, representing the area under the ROC curve.
20. **Neural Network**: A series of algorithms that attempt to recognize underlying
relationships in a set of data, inspired by the way human brains operate.
21. **Activation Function**: A function used in neural networks to introduce non-
linearity, such as ReLU, sigmoid, or tanh.
22. **Epoch**: One complete pass through the entire training dataset.
23. **Batch Size**: The number of training examples utilized in one iteration.
24. **Learning Rate**: A hyperparameter that controls how much to change the model
in response to the estimated error each time the model weights are updated.
25. **Cross-Validation**: A technique to evaluate the performance of a machine
learning model by partitioning the data into a training set and a test set multiple
times.
Certainly! Here are more essential machine learning terms every engineer should
know:
26. **Feature Engineering**: The process of selecting, modifying, and creating
features from raw data to improve the performance of a machine learning model.
27. **Feature Scaling**: The process of normalizing or standardizing features to
ensure they contribute equally to the model's performance. Common methods include
min-max scaling and z-score normalization.
28. **Dimensionality Reduction**: Techniques used to reduce the number of features
in a dataset while retaining important information. Principal Component Analysis
(PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) are popular methods.
29. **Bias-Variance Tradeoff**: The balance between bias (error due to overly
simplistic models) and variance (error due to overly complex models). Achieving the
right tradeoff is crucial for model performance.
30. **Ensemble Learning**: Combining multiple models to improve performance. Common
ensemble methods include bagging, boosting, and stacking.
31. **Bagging (Bootstrap Aggregating)**: An ensemble method that trains multiple
models on different subsets of the data and averages their predictions to reduce
variance.
32. **Boosting**: An ensemble method that trains models sequentially, with each new
model focusing on the errors made by previous models. Examples include AdaBoost,
Gradient Boosting, and XGBoost.
33. **Stacking**: An ensemble method that combines the predictions of multiple
models using a meta-model, which learns to make the final prediction.
34. **Decision Tree**: A model that uses a tree-like structure to make decisions
based on feature values. Each internal node represents a feature, each branch
represents a decision rule, and each leaf node represents an outcome.
35. **Random Forest**: An ensemble method that combines multiple decision trees to
improve performance and reduce overfitting.
36. **Support Vector Machine (SVM)**: A supervised learning algorithm used for
classification and regression tasks. It finds the hyperplane that best separates
the classes in the feature space.
37. **K-Nearest Neighbors (KNN)**: A simple, instance-based learning algorithm that
classifies new data points based on the majority class of their k-nearest
neighbors.
38. **Naive Bayes**: A probabilistic classification algorithm based on Bayes'
theorem, assuming independence between features.
39. **Clustering**: An unsupervised learning technique that groups similar data
points together. K-means, hierarchical clustering, and DBSCAN are common clustering
algorithms.
40. **Principal Component Analysis (PCA)**: A dimensionality reduction technique
that transforms features into a set of linearly uncorrelated components, ordered by
the amount of variance they explain.
41. **t-SNE (t-Distributed Stochastic Neighbor Embedding)**: A dimensionality
reduction technique used for visualizing high-dimensional data by projecting it
into lower-dimensional space.
42. **Recurrent Neural Network (RNN)**: A type of neural network designed for
sequential data, where connections between nodes form a directed graph along a
sequence. Commonly used in natural language processing and time series analysis.
43. **Convolutional Neural Network (CNN)**: A type of neural network designed for
processing grid-like data, such as images. It uses convolutional layers to
automatically and adaptively learn spatial hierarchies of features.
44. **Transfer Learning**: A technique where a pre-trained model is adapted for a
new, but related task, allowing for faster training and improved performance with
less data.
45. **AutoML**: Automated Machine Learning, which aims to automate the end-to-end
process of applying machine learning to real-world problems.
46. **Generative Adversarial Network (GAN)**: A type of neural network consisting
of two models (generator and discriminator) that are trained simultaneously, with
the generator creating realistic data and the discriminator distinguishing between
real and generated data.
47. **Long Short-Term Memory (LSTM)**: A type of RNN designed to capture long-term
dependencies in sequential data by using memory cells and gating mechanisms to
control the flow of information.
48. **Gradient Boosting Machine (GBM)**: An ensemble learning method that builds
additive models in a forward stage-wise manner, optimizing for a differentiable
loss function. Examples include XGBoost and LightGBM.
49. **Bayesian Optimization**: A method for optimizing hyperparameters by building
a probabilistic model of the objective function and iteratively selecting the most
promising hyperparameters to evaluate.
50. **Hyperparameter Tuning**: The process of selecting the best hyperparameters
for a machine learning model to improve its performance. Common techniques include
grid search, random search, and Bayesian optimization.