Categorical Encoding for Machine Learning

Categorical Encoding for Machine Learning

Why Encoding?

Supervised machine learning is the task of extracting latent relationships present in data. Typically, this involves a set of values (the input features/variables) and a given target (or sets of targets). One of the most common types of algorithms used is regression, whether it be linear (a continuous target) or logistic/multinomial (discrete target/s).

These algorithms result in a model defined in the following way: $$ Y = f(\alpha + \sum_{i=1}^{N} \beta_i x_i) $$ where $f$ can be either a linear scaling factor or the logit function.

This poses a challenge to datasets which have attributes other than purely numerics. For example, if a dataset is defined as follows (this is the Employee Analysis dataset from Kaggle), our above equation fails to apply directly:

AgeAttritionBusinessTravelDailyRateDepartmentDistanceFromHomeEducationEducationField
41YesTravel_Rarely1102Sales12LifeSciences
49NoTravel_Frequently279Research & Development81Life Sciences
37YesTravel_Rarely1373Research & Development22Other
33NoTravel_Frequently1392Research & Development34Life Sciences
27NoTravel_Rarely591Research & Development21Medical
32NoTravel_Frequently1005Research & Development22Life Sciences
59NoTravel_Rarely1324Research & Development33Medical
30NoTravel_Rarely1358Research & Development241Life Sciences
38NoTravel_Frequently216Research & Development233Life Sciences
36NoTravel_Rarely1299Research & Development273Medical
35NoTravel_Rarely809Research & Development163Medical

For example, how do I multiply Travel_Rarely in the BusinessTravel field to some \(\beta_j\) value above? We can't multiply base-10 numbers to letters/strings!!

This is a common issue, where our target is not only affected by quantitative values, but also by qualitative (nominal scale) values. Variables on a nominal scale have no direct numeric value, i.e. they are categories or some class. Nominal variables (such as "Attrition" and "BusinessTravel" above) have no natural order, but instead denote discrete categories. Ordinal variables refers to some order in measurement and indicates direction (e.g. higher is "more" than medium which is "more" than low). For these variables the ordering existing, but the absolute value for any given class is not defined. Interval scales provide information about order and also possess equal intervals. A common example is the Celsius or Kelvin scale, where a degree represents the same underlying amount of heat, regardless of where it occurs on the scale.

Categorical Encoding

In order to use any of the aforementioned types of variables mentioned above, some form of encoding is needed. This is the process of creating a vectorised numeric form for each class in an otherwise discrete set of classes with some thought being put into why each class is given a particular value.

The following outlines several ways in which this is done, a list of pros-and-cons as well as common libraries which implement these methods in python

One-Hot Encoding

From the very start, this is not recommended when the number of categories is large. One hot encoding takes a single column of multiple categories and transforms it into multiple columns of single categories. For example, in our dataset above, the BusinessTravel field would be turned into two columns: BusinessTravel_rarely and BusinessTravel_frequently, where each column has a 1 depending on what the original value was:

BusinessTravelBusinessTravel_rarelyBusinessTravel_frequently
Travel_Rarely10
Travel_Frequently01

If we had 15 different categories in this one column, then the dataset all of a sudden has 15 extra columns. Scale this across multiple columns and the computational burden increases quite quickly. Added to the fact that the curse of dimensionality kicks in, causing very weird optimization behaviour in OLS models. Additionally, this method captures no morphological information about the distribution of the variable, which may have been otherwise indicative in the overall learning/prediction process. Finally, there is no accomodation made for categories in the test set which were unseen in the training set.

Count/Frequency Encoding

In this method, the discrete categories are replaced by their raw count or frequency-of-occurence in the dataset. This does not have the drawback of introducing additional dimensions for our OLS to optimize within. However, categories which share no similarity but are distributed similarly may be difficult to distinguish. An extreme example of this may be a column of data with two values (e.g. Hot and Cold), with approximately half of the dataset being "Hot" and the other half being "Cold". In this case, both categories (semantically opposite) would be assigned a value close to \( 0.5 \), which makes no logical sense.

Mean Encoding

This is where the mean value of the target variable conditioned on the given category, is taken an used to replace the discrete categories. For example, if we have three values (say from the LifeSciences column above), and we are attempting to predict some binary target, we can take the mean value of the target variable for each category:

$$ X | Y=y_k = \frac{1}{M_k} \sum_i^{M_k} y_i
$$

(where \( M_k )\ is the number of rows with target \( Y=y_k )\. A significant advantage of this method is that the nature of the posterior probability for different categories is captured in relation to the target.

Min-Hash Encoding

This method was proposed by Cerda et al. 2020 to address the shortcomings of one-hot and count encoding. Hashing refers to encoding one set of values into a fixed set of hashed values, by means of some hashing function \( f \). Min-hash is a type of locality-sensitive hashing (LSH) techniques, which attempts to minimize the hash "distance" between more similar categorical values. This technique addresses specifically categorical variables with high cardinality (lots of different categories).

Without getting into the mathematical weeds, in order to create a similarity metric for \( n \) categories, \( 2^n \) comparisons are needed. Min-hashing is a type of LSH which attempts to band categories that do not appear similar more closely together, prior to computing the final hash value. This results in a decrease in the number of comparisons needed to compute the hash-distance between encoded values. However, as cited by Cerda et al. 2020, the ability to recover the initial categories for this technique was much lower than other LSH techniques (such as the similiarity encoder) and the following Gamma-Poisson encoding;

Gamma-Poisson (GP) Encoding

The single-line summary of GP encoding is that it "estimates a decomposition of the string en- tries in terms of a linear combination of latent categories" (again by Cerda et al. 2020). GP encoding utilizes the Gamma-Poisson model (Canny 2004) which was initially proposed to find latent topics in paragraphs of data. Instead, GP encoding takes the n-gram counts (think number-of-each character per category) and is modeled as a linear combination of known topics or "prototypes" which itself is estimated by factorizing the data.

Put more concisely, the GP model is the equivalent of minimizing the KL-divergence is minimized between resulting values the the non-negative matrix factorization subject to a Gamma distribution conditioned on the values of the input feature.

Helmert Encoding

This was proposed in Hutscheson 2011, specifically for categories with some natural order. The mean of the target variable for one category or level is compared to the mean of the target for all other levels. For example, if our variable has three levels

$$ B_0 = \frac{1}{3}(\bar{Y}_1+\bar{Y}_2+\bar{Y}_3) $$

$$ B_1 = \bar{Y}_1-\frac{1}{2}(\bar{Y}_2+\bar{Y}_3) $$

$$ B_3 = \bar{Y}_2 - \bar{Y}_3 $$

The main advantage of Helmert encoding is its computational ease, where finding means of target variables is a relatively simple operation that is easily vectorized

James-Stein Encoding

This encoding strategy takes a weight parameter \( B \). Each category is replaced with a numeric value defined in the following way: $$ i_{enc}=(1-B)\times mean(y_i)+B\times mean(y) $$

where \( y_i \) is the mean of the target for the given category. An issue which may arise in this method is in the choice of \( B \). \( B \) is a weight variable, which adjusts the level of significance given to the overal target mean as opposed to the mean for the category in question. If \( B \) is too high, the learning algorithm may overfit the data, whilst if \( B \) is too low, the representations generated tend to be too general resulting in underfit. However, if the estimate of conditional mean is unreliable (high variance and/or low number of examples), we put more weight on global mean.

It is of note that this method was initially intended to work with normally distributed targets, a "hack" for binary attributes is converting to the mean target interval using log-odds.

Probability Ratio Encoding

Probability Ratio Encoding is similar to Weight Of Evidence(WoE), with the only difference is only the ratio of good and bad probability is used. For each label, we calculate the mean of target=1, that is, the probability of being 1 ( P(1) ), and also the probability of the target=0 ( P(0) ). We then replace the labels with this ratio. This is a direct improvement from frequency encoding, but with the benefit of capturing distributional information about the target value.

This can be extended to work with multiple categories by generalizing the target.


The following is a collection of open-source libraries which implement the methods above in python.

This is done by the original authors to implement min-hash and Gamma-Poisson encoding:

Implements min-hash and similarity encoding:

This is the most extensive and implements weight-of-encoding, one-hot, polynomial and a host of other encoding strategies:


My socials:
%[twitter.comaadiDev] %[linkedin.com/in/aadidev-sooknanan/]