Machine Learning for Sustainable Development Goal 6: Clean Water and
Sanitation
1. Introduction
Project Objective: To use machine learning to address challenges in clean water and
sanitation, aiming to support SDG 6 by predicting water quality, identifying contamination
sources, and forecasting water demand in under-resourced areas.
Motivation: Access to clean water is essential for health and well-being. By utilizing machine
learning, we aim to create predictive tools that can support resource allocation,
maintenance, and sanitation efforts.
2. Data Collection
Data Source: Kaggle Dataset (e.g., “Water Quality Dataset” or “Drinking Water Quality
Dataset”)
Dataset Description:
- Features: pH, hardness, solids, chloramines, sulfate, organic carbon, trihalomethanes,
turbidity, and water quality labels.
- Size: X rows by Y columns
- Target Variable: Water Quality (binary/multiclass)
3. Exploratory Data Analysis (EDA)
Summary Statistics: Mean, median, and distribution of each feature.
Visualizations:
- Correlation heatmap to understand relationships between variables.
- Boxplots for outlier detection.
- Histograms to assess the distribution of each variable.
Insights: Key trends or anomalies in pH levels, hardness, or contamination levels.
4. Data Preprocessing
Handling Missing Values: Used median imputation for features with missing values.
Encoding Categorical Variables: One-hot encoding for any categorical features.
Feature Scaling: Standardized features using `StandardScaler` for better performance in
machine learning models.
5. Machine Learning Model Selection
Model Choices:
- Logistic Regression (for binary classification).
- Random Forest Classifier (for handling non-linear relationships and feature importance).
- Support Vector Machine (SVM) for optimal margin separation.
Why Scikit-Learn: Easy implementation, variety of algorithms, and effective performance
metrics.
Evaluation Metric: Accuracy, Precision, Recall, and F1-Score due to the critical nature of
accurately identifying contamination.
6. Model Implementation
Data Splitting: Split dataset into 80% training and 20% testing sets using `train_test_split`
from Scikit-Learn.
Hyperparameter Tuning:
- Used GridSearchCV for Random Forest to identify optimal number of estimators and max
depth.
- Cross-validation with 5 folds to improve model generalization.
Code Example:
from [Link] import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from [Link] import classification_report
# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Hyperparameter tuning for Random Forest
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [10, 20, 30]
}
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='f1')
grid_search.fit(X_train, y_train)
# Best model and evaluation
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(classification_report(y_test, y_pred))
7. Results and Evaluation
Model Performance:
- Random Forest achieved an accuracy of X%, F1-score of Y%, and precision/recall values
indicating the model’s strength in predicting contamination risk.
Feature Importance:
- Insights into which features (e.g., pH, turbidity, chloramines) contribute most to water
quality predictions.
Confusion Matrix: Visualized true vs. predicted values to identify common
misclassifications.
8. Conclusion and Future Work
Key Takeaways: Machine learning models effectively predict water quality based on
chemical and physical properties. The project demonstrates potential for real-time
monitoring and resource allocation.
Future Improvements:
- Incorporating real-time data for continuous learning.
- Expanding to a broader dataset covering multiple regions.
- Implementing models on edge devices for on-site analysis in remote areas.
9. References
- Kaggle Dataset
- Scikit-Learn Documentation