Categorical preprocessors

Covering use cases not handled by native scikit-learn transformers.

TargetEncoder

 TargetEncoder (task:str, handle_unknown='error', handle_missing='')

Encode categorical features considering the effect that it has in the target variable.

Note that implementation and docstrings are largely taken from Dirty Cat.

	Type	Default	Details
task	str		The type of problem. Either “classification” or “regression”.
handle_unknown	str	error	Either “error” or “ignore”. Whether to raise an error or ignore if a unknown categorical feature is present during transform (default is to raise). If ‘ignore’, unknown categories will be set to the mean of the target.
handle_missing	str		Either “error” or ““. Whether to raise an error or impute with blank string”” if missing values (NaN) are present during fit (default is to impute).

In general, TargetEncoder takes the ratio between the mean of the target for a given category and the mean of the target. In addition, it takes an empirical Bayes approach to shrink the estimate.

It is particularly useful with high cardinality categoricals, as it will not expand the feature space as much as one hot encoding, but retains more information than ordinal encoding.

For more details, see Micci-Barreca, 2001: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.

source

TargetEncoder.fit

 TargetEncoder.fit
                    (X:Union[pandas.core.frame.DataFrame,numpy.ndarray,Lis
                    t], y:Union[pandas.core.frame.DataFrame,numpy.ndarray,
                    List])

Fit the TargetEncoder to X.

	Type	Details
X	Union[pd.DataFrame, np.ndarray, List]	The data to determine the categories of each feature.
y	Union[pd.DataFrame, np.ndarray, List]	The associated target vector.
Returns	TargetEncoder	Fitted TargetEncoder.

After fitting, the categories of each feature are held in the categories_ attribute.

source

TargetEncoder.transform

 TargetEncoder.transform
                          (X:Union[pandas.core.frame.DataFrame,numpy.ndarr
                          ay,List])

Transform X using specified encoding scheme.

	Type	Details
X	Union[pd.DataFrame, np.ndarray, List]	Transformed input.

TransformerMixin.fit_transform

 TransformerMixin.fit_transform (X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

	Type	Default	Details
X	array-like of shape (n_samples, n_features)		Input samples.
y	NoneType	None	Target values (None for unsupervised transformations).
fit_params
Returns	ndarray array of shape (n_samples, n_features_new)		Transformed array.

import pandas as pd
import numpy as np

rng = np.random.default_rng(0)

X = pd.DataFrame(
    {
        "sex": rng.choice(["female", "male", "other"], size=10),
        "status": rng.choice(
            ["employed", "unemployed", "retired", "inactive"], size=10
        ),
    }
)
y = rng.choice(["low", "high"], size=10)

encoder = TargetEncoder(task="classification", handle_unknown="ignore")
pd.DataFrame(encoder.fit_transform(X, y), columns=encoder.get_feature_names_out())

	sex	status
0	0.437500	0.416667
1	0.250000	0.375000
2	0.250000	0.416667
3	0.464286	0.416667
4	0.464286	0.375000
5	0.464286	0.416667
6	0.464286	0.416667
7	0.464286	0.416667
8	0.464286	0.416667
9	0.437500	0.375000

In the case of a multiclass target, the encodings are computed separately for each label, meaning that each feature will be expanded to as many unique levels in the target.

y = rng.choice(["low", "mid", "high"], size=10)

encoder = TargetEncoder(task="classification", handle_unknown="ignore")
pd.DataFrame(encoder.fit_transform(X, y), columns=encoder.get_feature_names_out())

	sex_high	sex_low	sex_mid	status_high	status_low	status_mid
0	0.312500	0.3125	0.375000	0.250	0.458333	0.291667
1	0.125000	0.6875	0.187500	0.125	0.562500	0.312500
2	0.125000	0.6875	0.187500	0.250	0.458333	0.291667
3	0.178571	0.5000	0.321429	0.250	0.458333	0.291667
4	0.178571	0.5000	0.321429	0.125	0.562500	0.312500
5	0.178571	0.5000	0.321429	0.250	0.458333	0.291667
6	0.178571	0.5000	0.321429	0.250	0.458333	0.291667
7	0.178571	0.5000	0.321429	0.250	0.458333	0.291667
8	0.178571	0.5000	0.321429	0.250	0.458333	0.291667
9	0.312500	0.3125	0.375000	0.125	0.562500	0.312500