Missing data is a ubiquitous challenge in data analysis and machine learning. Data sets rarely arrive perfectly clean. They often have missing values, whether due to human error, equipment malfunction, privacy concerns or survey non-response. Missing values can severely bias model training, reduce statistical power and complicate data processing pipelines.
Dealing with missing data, often referred to as data imputation, is not just a technical process, but also a critical decision that can influence the final conclusions drawn from an analysis. Effective imputation balances the need to preserve the underlying structure of the data with the risk of introducing bias or artificial variability.
What Is Data Imputation?
Data imputation is a general approach to handling missing data/values in the context of statistical and/or machine learning modeling. Specifically, during data imputation, we try to fill in the missing data with appropriate values.
Types of Missing Data
Understanding why data is missing is the first step toward effective imputation. Statisticians generally categorize missing data into three types:
Missing Completely at Random (MCAR)
The probability of a data point being missing is unrelated to both the observed and unobserved data. If you randomly drop a book on a page of data and smudge a number, that is MCAR.
Missing at Random (MAR)
The probability of a data point being missing is related to the observed data, but not to the missing data itself. For example, men might be less likely to report their weight, but the reason for its absence depends on the observed variable (gender), not the value of weight itself.
Missing Not at Random (MNAR)
The probability of a data point being missing is related to the missing value itself. For instance, people with extremely high incomes might be less likely to report their salary.
Most imputation techniques assume MAR or MCAR. MNAR is the most challenging and often requires modeling the missing data mechanism itself.
The Need for Data Imputation
Many state-of-the-art, off-the-shelf machine learning algorithms (e.g. XGBoost, LightGBM) natively support handling of missing values as part of the training process. When using these algorithms, data imputation is not strictly necessary, but it still might be desirable. Besides the consideration of missing data types outlined above, the nature of the domain or use case often informs the necessity of data imputation in this case.
Common Imputation Techniques
The following table provides a comparison of several common and modern data imputation methods:
Advanced Imputation Strategies
For more complex data sets, researchers turn to advanced techniques.
Deep Learning Models
Using generative adversarial networks (GANs) or variational autoencoders (VAEs) to learn the underlying data distribution and generate synthetic values for missing entries. This is particularly effective for high-dimensional and non-linear data.
Expectation-Maximization (EM) Algorithm
An iterative statistical method that estimates the model parameters (the expectation step) and then uses those parameters to impute the missing values (the maximization step), repeating until convergence.
Deletion Methods
While not imputation, methods like listwise deletion (removing any row with a missing value) or pairwise deletion (using all available data for a specific analysis) are often compared against imputation. These are generally discouraged as they lead to a significant loss of information and potential bias, especially if the data is not MCAR.
In particular, data points containing missing values might still contain very useful information, and in the extreme case, a majority of data points might contain some missing values, in which case the deletion will remove most of the data.
Bayesian Methods
A probabilistic model of the missing variables can be formulated. Based on this, using Bayesian methods, we can obtain a distribution on the missing variables. For instance, this can enhance the MICE approach.
Practical Recommendations for Data Imputation
When faced with missing data, consider the following best practices:
1. Visualize and Analyze
Plot the missing data patterns. Is it confined to one column? Does the missingness correlate with another variable? Use this analysis to hypothesize the type of missingness (MCAR, MAR, MNAR).
2. Decide to Impute Data
As mentioned above, data imputation might not be necessary when using certain machine learning algorithms. At this point, a decision needs to be made to proceed with data imputation. If it is determined to let the machine learning algorithms handle missing values natively, then we do not need to proceed further, which might save some effort.
3. Start Simple, Then Iterate
Begin with simple methods like mean/median/mode, especially for preliminary analyses. If these methods prove inadequate (e.g., they drastically change the variable’s distribution), move to more sophisticated techniques.
4. Use Robust Techniques
For final model deployment, multiple imputation is often the gold standard as it provides the most statistically sound estimates and uncertainty measures. For predictive modeling, KNN imputation or model-based imputation are strong alternatives. This is because imputation based on models rely more on the characteristics seen in the data (i.e., are more data-driven).
5. Impute the Training Set Only
Crucially, the imputation model or statistics (like the mean) should be derived only from the training data to prevent data leakage. The same statistics are then applied to the validation and test sets.
The AI Behind This Article
While drafting this article on data imputation, I used AI much like I do in my day-to-day work: as an efficient assistant and second set of eyes rather than a replacement for judgment. I wrote the core structure and examples myself, then used AI to stress-test the outline and tighten the explanations. I take the same approach at Coinbase, using AI to speed up experimentation and documentation so I can focus on the modeling decisions that determine how we handle missing data in production systems. If you’re interested in working at an AI-forward company, you can explore our open roles on Coinbase’s Built In profile.
Frequently Asked Questions
Can I use tree-based models (like XGBoost) instead of explicit imputation?
Yes, a lot of tree-based models have the capability to handle missing values natively, which might be sufficient for the task at hand (see the section The Need for Data Imputation above). Still, one might want to consider the particular domain and see whether this makes sense or not.
How do I handle imputation for categorical variables with high cardinality?
Categorical variables present specific challenges for imputation, especially so when the variables have high cardinality. Here the recommendation is to be especially careful, consider aspects of the domain and see what makes sense. If tree-based models capable of handling missing values can be used, it is highly recommended to offload the handling of this kind of missing values to the model.
