Random Forest Algorithm (for Crab Age Prediction)
How it Works: Random Forest is an ensemble learning algorithm that creates multiple decision trees. It splits data
randomly at each node and averages the predictions of all trees for regression tasks like predicting the age of crabs.
Steps:
1. Collect Data: Gather crab data (e.g., size, weight, shell dimensions) and their ages.
2. Preprocess Data: Handle missing data and split the data into training and testing sets.
3. Train Model: Build a Random Forest model using the training data.
4. Evaluate: Use metrics like Mean Absolute Error (MAE) and R² to assess the model’s performance.
Advantages:
Can capture complex, non-linear relationships.
Robust to overfitting and handles missing data well.
CODE
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import StandardScaler
# Load your dataset (replace 'your_dataset.csv' with the actual file path)
dataset = pd.read_csv('your_dataset.csv')
# Assume the last column is the target variable
X = dataset.iloc[:, :-1] # Features
y = dataset.iloc[:, -1] # Target variable
# Preprocess the features (Standardizing the data)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Create the Random Forest model with default parameters
model = RandomForestClassifier(random_state=42)
# Hyperparameter tuning using GridSearchCV to find the best parameters
param_grid = {
'n_estimators': [50, 100, 200], # Number of trees
'max_depth': [None, 10, 20, 30], # Maximum depth of trees
'min_samples_split': [2, 5, 10], # Minimum samples required to split a node
'min_samples_leaf': [1, 2, 4], # Minimum samples required at a leaf node
'bootstrap': [True, False] # Bootstrap sampling (whether to use bootstrapping)
}
# Set up GridSearchCV with cross-validation
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, verbose=2)
# Fit the GridSearchCV model on the training data
grid_search.fit(X_train, y_train)
# Get the best parameters from the grid search
best_params = grid_search.best_params_
print(f"Best Hyperparameters: {best_params}")
# Train the Random Forest model with the best parameters
best_model = grid_search.best_estimator_
# Predict on the test set
y_pred = best_model.predict(X_test)
# Evaluate the model's accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of Random Forest model: {accuracy * 100:.2f}%")
# Print a classification report for more detailed performance analysis
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# Perform cross-validation to assess the model's stability
cv_scores = cross_val_score(best_model, X_scaled, y, cv=5)
print(f"Cross-Validation Accuracy: {cv_scores.mean() * 100:.2f}% ± {cv_scores.std() * 100:.2f}%")
Accuracy of Random Forest model: 80.00%