import pandas as pd
import numpy as np
Categorical preprocessors
TargetEncoder
TargetEncoder (task:str, handle_unknown='error', handle_missing='')
Encode categorical features considering the effect that it has in the target variable.
Note that implementation and docstrings are largely taken from Dirty Cat.
Type | Default | Details | |
---|---|---|---|
task | str | The type of problem. Either “classification” or “regression”. | |
handle_unknown | str | error | Either “error” or “ignore”. Whether to raise an error or ignore if a unknown categorical feature is present during transform (default is to raise). If ‘ignore’, unknown categories will be set to the mean of the target. |
handle_missing | str | Either “error” or ““. Whether to raise an error or impute with blank string”” if missing values (NaN) are present during fit (default is to impute). |
In general, TargetEncoder
takes the ratio between the mean of the target for a given category and the mean of the target. In addition, it takes an empirical Bayes approach to shrink the estimate.
It is particularly useful with high cardinality categoricals, as it will not expand the feature space as much as one hot encoding, but retains more information than ordinal encoding.
For more details, see Micci-Barreca, 2001: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems.
TargetEncoder.fit
TargetEncoder.fit (X:Union[pandas.core.frame.DataFrame,numpy.ndarray,Lis t], y:Union[pandas.core.frame.DataFrame,numpy.ndarray, List])
Fit the TargetEncoder to X.
Type | Details | |
---|---|---|
X | Union[pd.DataFrame, np.ndarray, List] | The data to determine the categories of each feature. |
y | Union[pd.DataFrame, np.ndarray, List] | The associated target vector. |
Returns | TargetEncoder | Fitted TargetEncoder. |
After fitting, the categories of each feature are held in the categories_
attribute.
TargetEncoder.transform
TargetEncoder.transform (X:Union[pandas.core.frame.DataFrame,numpy.ndarr ay,List])
Transform X using specified encoding scheme.
Type | Details | |
---|---|---|
X | Union[pd.DataFrame, np.ndarray, List] | Transformed input. |
TransformerMixin.fit_transform
TransformerMixin.fit_transform (X, y=None, **fit_params)
Fit to data, then transform it.
Fits transformer to X
and y
with optional parameters fit_params
and returns a transformed version of X
.
Type | Default | Details | |
---|---|---|---|
X | array-like of shape (n_samples, n_features) | Input samples. | |
y | NoneType | None | Target values (None for unsupervised transformations). |
fit_params | |||
Returns | ndarray array of shape (n_samples, n_features_new) | Transformed array. |
= np.random.default_rng(0)
rng
= pd.DataFrame(
X
{"sex": rng.choice(["female", "male", "other"], size=10),
"status": rng.choice(
"employed", "unemployed", "retired", "inactive"], size=10
[
),
}
)= rng.choice(["low", "high"], size=10)
y
= TargetEncoder(task="classification", handle_unknown="ignore")
encoder =encoder.get_feature_names_out()) pd.DataFrame(encoder.fit_transform(X, y), columns
sex | status | |
---|---|---|
0 | 0.437500 | 0.416667 |
1 | 0.250000 | 0.375000 |
2 | 0.250000 | 0.416667 |
3 | 0.464286 | 0.416667 |
4 | 0.464286 | 0.375000 |
5 | 0.464286 | 0.416667 |
6 | 0.464286 | 0.416667 |
7 | 0.464286 | 0.416667 |
8 | 0.464286 | 0.416667 |
9 | 0.437500 | 0.375000 |
In the case of a multiclass target, the encodings are computed separately for each label, meaning that each feature will be expanded to as many unique levels in the target.
= rng.choice(["low", "mid", "high"], size=10)
y
= TargetEncoder(task="classification", handle_unknown="ignore")
encoder =encoder.get_feature_names_out()) pd.DataFrame(encoder.fit_transform(X, y), columns
sex_high | sex_low | sex_mid | status_high | status_low | status_mid | |
---|---|---|---|---|---|---|
0 | 0.312500 | 0.3125 | 0.375000 | 0.250 | 0.458333 | 0.291667 |
1 | 0.125000 | 0.6875 | 0.187500 | 0.125 | 0.562500 | 0.312500 |
2 | 0.125000 | 0.6875 | 0.187500 | 0.250 | 0.458333 | 0.291667 |
3 | 0.178571 | 0.5000 | 0.321429 | 0.250 | 0.458333 | 0.291667 |
4 | 0.178571 | 0.5000 | 0.321429 | 0.125 | 0.562500 | 0.312500 |
5 | 0.178571 | 0.5000 | 0.321429 | 0.250 | 0.458333 | 0.291667 |
6 | 0.178571 | 0.5000 | 0.321429 | 0.250 | 0.458333 | 0.291667 |
7 | 0.178571 | 0.5000 | 0.321429 | 0.250 | 0.458333 | 0.291667 |
8 | 0.178571 | 0.5000 | 0.321429 | 0.250 | 0.458333 | 0.291667 |
9 | 0.312500 | 0.3125 | 0.375000 | 0.125 | 0.562500 | 0.312500 |