Imputation Methods for Missing Data

Imputation Methods for Missing Data

It comes a point in time when anyone who deals with data must face the ugly truth, the real-world is a messy, sometimes nondeterministic place, and messy, missing data is something that will occupy at least half of your time. In this post, I outline a few statistical methods uses for intelligently replacing (i.e. imputing) missing data, along with each methods advantages/disadvantages. Finally, I also mention libraries in Python which implement these methods (where available)

Missing Data

Before we go into imputation methods, we need to define what we mean by "missing". This is more nuanced than it first appears, so bear with me. There are no less than three types of missing data: data missing at random, missing completely at random and missing not at random.

Missing at Random (MAR)

This missing data may depend on observed data. For example, in the following table, we have three columns (this is for sake of example and most real-world data would have much more, less-intuitively named columns):

NameRetiredDate Retired
SkywalkerYes2022-03-01
ReyNo
SoloYes2020-01-01

In the above, the "missing" values in the Date Retired column depends directly on the Retired column. If someone has not retired, then there can be no Date Retired.

Missing Completely at Random (MCAR)

This is where the data in a given column does not linearly depend on any other column. This is the truest form of "random" missing data, and tends to be the easiest to impute. A simple strategy of determining whether the data is MAR or MCAR is to create a column called "missing", and figure out the linear correlation between the target variable and the binary variable called "missing". If there exists a correlation, then the data is more likely to be MAR than MCAR.

Not Missing at Random (NMAR)

This is where the value that is missing depends on the actual value of the attribute. For example, if it is company policy to not record the ages of persons older than 65 (again, illustrative example), then any age column in your table/databse would have missing data for persons over 65. In this case, the missing value depends directly on the range of values in a given data column.

Imputation Methods

Hasan et al. 2021 split imputation methods into statistical and machine-learning based. The following outlines methods which fall into the statistical category:

Dropping Missing Values

This method is the most straightforward to implement; drop any rows with missing values. It is the most brute-force way of reducing missing-values and comes with several drawbacks. Firstly, it reduces sample size and introduces bias in the data. Furthermore, under MCAR, listwise deletion causes a loss of power in finding correlations in the dataset. Since this method does not account for any relationships that may exist between the patterns of missing values, it is not recommended

Mean/Median Filling

This can be the quickest way to obtain a decent reduction in the missing value count of a dataset. It does not decrease the sample size of data, and is not computationally expensive.

Although this is straightforward, this method comes with two main drawbacks:

  1. It reduces the natural variance in the column. This may result in overfitting if a machine-learning model is trained on the imputed data
  2. The value imputed does not account for categorical variables, nor does it account for patterns in the data. This method may fill missing values with a numeric that is out of the domain of the original data

iDMI

This method was proposed by Rahman et. al. 2014 and uses a decision-tree to split the data. The primal assumption is as follows:

Correlations of attributes for records which belong to a leaf in the decision tree (as measured by the Gini coefficient) are higher than the correlations of attributes for records within the whole data set

The novel aspect of this method was imputation using the mean of a given leaf. In essence, this method first split the dataset using Gini coefficient into similar groups (defined by the decision-tree leaves). The missing values are then imputed based on the leaf mean, not the global mean of the dataset. This was based on an earlier paper, and works by iteratively estimating the mean and covariance matrices for the missing data as a function of the observed data until convergence.

iDMI assumes MCAR, but is orders of magnitude slower than other EM-type imputation techniques.

Hot-Deck Imputation

This method was proposed in Myers 2011, where the author referred to listwise deletion as "the worst possible method". Hot-deck imputation relaxes typical assumptions of EM-type imputation methods, and replaces missing values with values of a similar donor record in the dataset.

Hot deck imputation sorts the data with respect to a given variable. Records with complete data which match records with missing data are "eligible" to donate the complete data (with some probability) to the missing data records. After sorting the data into decks, all records within a given deck is randomly sorted, and random permutations of complete data are used to fill the missing data.

This method has the effect of assigning non-missing values in missing spaces by randomly sampling without replacement, thereby resulting in imputed values within the expected domain, and do not require an EM-model to build. The downside of this approach is that truly unique missing values have no chance of being filled in, since this method relies directly on other records to impute the missing data.

Local Least Squares Imputation

This method attempts to find correlations amongst rows in the database, and missing values are imputed by a linear combination of the k nearest selected variables. The optimal combination of these variables is found by a least-squares regression an was first described in Kim et. al. 2005

A similar way of thinking about this method is by considering all variables as candidates for regression. Initially, missing values are replaces by the column-wise mean. After multiple iteration, the current estimate in a given column is updated until convergence using least-squares.

MissForest

This method was proposed in Stekhoven et. al. 2012 as a non-parametric missing value method for mixed-type data. This method uses a random-forest, trained on the observed values of the dataset to predict the missing values. The algorithm proposed first filled the missing values with column-wise means, and sorted the dataset according to the amount of missing values starting with the lowest amount.

For each variable, the missing values are imputed by first fitting a random-forest to fully-defined data (with no missing values), and the impute the missing values using the partially specified data. This method is iterated until some predefined stopping criterion is met. This was defined in the original paper as the difference between changes in current and previous iterations dropping below some given level for all variables.

This method has the advantage of not needing tuning parameters, and is able to handle continuous and categorical variables. Furthermore, this method is easily scaled to high-dimensional datasets where the number of variables may greatly exceed the number of observations