Categorical preprocessors

Covering use cases not handled by native scikit-learn transformers.

source

TargetEncoder

 TargetEncoder (task:str, handle_unknown='error', handle_missing='')

Encode categorical features considering the effect that it has in the target variable.

Note that implementation and docstrings are largely taken from Dirty Cat.

Type Default Details
task str The type of problem. Either “classification” or “regression”.
handle_unknown str error Either “error” or “ignore”. Whether to raise an error or ignore if a unknown
categorical feature is present during transform (default is to raise). If ‘ignore’,
unknown categories will be set to the mean of the target.
handle_missing str Either “error” or ““. Whether to raise an error or impute with blank string”” if missing
values (NaN) are present during fit (default is to impute).

In general, TargetEncoder takes the ratio between the mean of the target for a given category and the mean of the target. In addition, it takes an empirical Bayes approach to shrink the estimate.

It is particularly useful with high cardinality categoricals, as it will not expand the feature space as much as one hot encoding, but retains more information than ordinal encoding.

For more details, see Micci-Barreca, 2001: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.


source

TargetEncoder.fit

 TargetEncoder.fit
                    (X:Union[pandas.core.frame.DataFrame,numpy.ndarray,Lis
                    t], y:Union[pandas.core.frame.DataFrame,numpy.ndarray,
                    List])

Fit the TargetEncoder to X.

Type Details
X Union[pd.DataFrame, np.ndarray, List] The data to determine the categories of each feature.
y Union[pd.DataFrame, np.ndarray, List] The associated target vector.
Returns TargetEncoder Fitted TargetEncoder.

After fitting, the categories of each feature are held in the categories_ attribute.


source

TargetEncoder.transform

 TargetEncoder.transform
                          (X:Union[pandas.core.frame.DataFrame,numpy.ndarr
                          ay,List])

Transform X using specified encoding scheme.

Type Details
X Union[pd.DataFrame, np.ndarray, List] Transformed input.

TransformerMixin.fit_transform

 TransformerMixin.fit_transform (X, y=None, **fit_params)

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Type Default Details
X array-like of shape (n_samples, n_features) Input samples.
y NoneType None Target values (None for unsupervised transformations).
fit_params
Returns ndarray array of shape (n_samples, n_features_new) Transformed array.
import pandas as pd
import numpy as np
rng = np.random.default_rng(0)

X = pd.DataFrame(
    {
        "sex": rng.choice(["female", "male", "other"], size=10),
        "status": rng.choice(
            ["employed", "unemployed", "retired", "inactive"], size=10
        ),
    }
)
y = rng.choice(["low", "high"], size=10)

encoder = TargetEncoder(task="classification", handle_unknown="ignore")
pd.DataFrame(encoder.fit_transform(X, y), columns=encoder.get_feature_names_out())
sex status
0 0.437500 0.416667
1 0.250000 0.375000
2 0.250000 0.416667
3 0.464286 0.416667
4 0.464286 0.375000
5 0.464286 0.416667
6 0.464286 0.416667
7 0.464286 0.416667
8 0.464286 0.416667
9 0.437500 0.375000

In the case of a multiclass target, the encodings are computed separately for each label, meaning that each feature will be expanded to as many unique levels in the target.

y = rng.choice(["low", "mid", "high"], size=10)

encoder = TargetEncoder(task="classification", handle_unknown="ignore")
pd.DataFrame(encoder.fit_transform(X, y), columns=encoder.get_feature_names_out())
sex_high sex_low sex_mid status_high status_low status_mid
0 0.312500 0.3125 0.375000 0.250 0.458333 0.291667
1 0.125000 0.6875 0.187500 0.125 0.562500 0.312500
2 0.125000 0.6875 0.187500 0.250 0.458333 0.291667
3 0.178571 0.5000 0.321429 0.250 0.458333 0.291667
4 0.178571 0.5000 0.321429 0.125 0.562500 0.312500
5 0.178571 0.5000 0.321429 0.250 0.458333 0.291667
6 0.178571 0.5000 0.321429 0.250 0.458333 0.291667
7 0.178571 0.5000 0.321429 0.250 0.458333 0.291667
8 0.178571 0.5000 0.321429 0.250 0.458333 0.291667
9 0.312500 0.3125 0.375000 0.125 0.562500 0.312500