Train-Test-Validation Split: Complete ML Interview Guide
1. Core Definitions
Training Set
Purpose: Data used to train/fit the model parameters
What happens: Model learns patterns, weights, coefficients
Typical size: 60-80% of total data
Validation Set
Purpose: Data used for model selection, hyperparameter tuning, and preventing overfitting
What happens: Evaluate different models/hyperparameters during development
Typical size: 10-20% of total data
Key point: Never used for training, only for evaluation during development
Test Set
Purpose: Final, unbiased evaluation of model performance
What happens: Simulate real-world performance on completely unseen data
Typical size: 10-20% of total data
Critical rule: Only used ONCE at the very end
2. Common Split Ratios
Split Type Training Validation Test When to Use
70-15-15 70% 15% 15% Standard for medium datasets
80-10-10 80% 10% 10% When you have limited data
60-20-20 60% 20% 20% When extensive validation needed
80-20-0 80% 0% 20% Simple train-test (no hypertuning)
3. Why We Need Each Split
Why Not Just Train-Test?
❌ Problem: If you tune hyperparameters on test set, you're indirectly training on it
Test performance becomes overly optimistic
Model won't generalize to truly new data
✅ Solution: Use validation set for all model development decisions
The "Data Leakage" Problem
Training: Learn patterns
Validation: Select best model/hyperparameters
Test: Get honest performance estimate
4. Step-by-Step Process
1. Split data → Train | Validation | Test
2. Train multiple models on Training set
3. Evaluate all models on Validation set
4. Select best model based on Validation performance
5. ONLY THEN evaluate final model on Test set
6. Report Test performance as final result
5. Cross-Validation vs Simple Split
Simple Train-Validation-Test Split
Pros: Fast, simple, mimics real deployment
Cons: Validation results depend on random split
K-Fold Cross-Validation
Process: Split training data into K folds, use K-1 for training, 1 for validation, repeat K times
Pros: More robust validation, uses all data for both training and validation
Cons: Computationally expensive (K times more training)
Nested Cross-Validation
Outer loop: For final performance estimation (replaces test set)
Inner loop: For hyperparameter tuning (replaces validation set)
Use case: When data is very limited
6. Common Interview Questions & Answers
Q: "What happens if you tune hyperparameters on the test set?"
A: You introduce data leakage. The test set is no longer "unseen" - you've optimized for it. This leads to
overly optimistic performance estimates that won't hold in production.
Q: "How do you choose split ratios?"
A: Consider:
Dataset size: Smaller datasets need larger training portions
Model complexity: Complex models need more training data
Hyperparameter search space: Extensive tuning needs larger validation sets
Business requirements: How precise does your final estimate need to be?
Q: "What if your validation and test performance are very different?"
A: This suggests:
High variance: Your model is sensitive to data splits
Solution: Use cross-validation or stratified sampling
May indicate: Insufficient data or data distribution issues
7. Best Practices for Different Scenarios
Time Series Data
❌ Don't: Random split (breaks temporal order) ✅ Do: Chronological split
Past → Present → Future
Train → Validation → Test
Imbalanced Classes
❌ Don't: Random split (may create unbalanced splits) ✅ Do: Stratified split (maintains class
proportions)
Small Datasets
Consider Leave-One-Out Cross-Validation
Use stratified sampling
Maybe skip separate test set, use cross-validation for everything
Very Large Datasets
Can use smaller percentages for validation/test (e.g., 98-1-1)
Random splits are usually fine due to law of large numbers
8. Common Pitfalls (Interview Red Flags)
❌ Data Leakage
Using test set for any model development decisions
Feature selection on entire dataset before splitting
Preprocessing on entire dataset before splitting
❌ Temporal Leakage
Random splits on time series data
Using future information to predict past
❌ Target Leakage
Including features that wouldn't be available at prediction time
Features that are consequences of the target
9. Advanced Concepts
Holdout Validation
Simple train-validation-test split
Good for large datasets
Fast but less robust
Bootstrap Sampling
Sample with replacement from training data
Good for small datasets
Provides confidence intervals
Group-Based Splits
When data points are grouped (e.g., by patient, by store)
Ensure same group doesn't appear in multiple splits
Prevents data leakage through group information
10. Practical Implementation Tips
Reproducibility
python
# Always set random seed
train_test_split(X, y, test_size=0.2, random_state=42)
Stratification
python
# For classification - maintain class balance
train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
Pipeline Approach
1. Split data first (before any preprocessing)
2. Fit preprocessing on training data only
3. Apply same preprocessing to validation/test
11. Key Takeaways for Interviews
1. Purpose: Each split has a specific role - don't mix them up
2. Order: Always split first, then preprocess
3. Test set: Use only once, at the very end
4. Validation: Use for all model development decisions
5. Cross-validation: More robust but computationally expensive
6. Domain-specific: Time series and grouped data need special handling
7. Data leakage: The biggest sin in ML - avoid at all costs
12. Quick Mental Framework
Remember the "Three Questions":
1. Training: "How do I learn?"
2. Validation: "Which version of me is best?"
3. Test: "How well will I actually perform?"
Each dataset answers exactly one of these questions!