import random
import pandas as pd
import numpy as np
from poniard import PoniardClassifier
Preprocessing
Introduction
Poniard tries to apply minimal preprocessing to data. In general, it just tries to make sure that models fit correctly without introducing signifcant transformation overhead. In particular, there is no anomaly detection, dimensionality reduction, clustering, resampling, feature creation from polynomial interactions, feature selection, etc.
This is so the user always knows what’s going on.
However, the default options may not be suitable for your data or objectives, so these can be set during initialization or modified afterwards.
Default preprocessing pipeline
The list of default transformations is:
- Missing data imputation.
- Z-score scaling for numeric variables.
- One-hot encoding for low cardinality categorical variables.
- Target encoding for the remaining categorical variables. This is a custom transformer based on Micci-Barreca, 2001, with implementation heavily based on Dirty Cat. If the task is multilabel or multioutput, ordinal encoding will be used instead.
- Datetime encoding for datetime variables. This also uses a custom transformer that extracts multiple datetime levels.
- Zero-variance feature elimination.
This includes some type inference logic that decides whether a given feature is either numeric, categorical high cardinality, categorical low cardinality or datetime (see Type inference
).
0)
random.seed(= np.random.default_rng(0)
rng
= pd.DataFrame(
data
{"type": random.choices(["house", "apartment"], k=500),
"age": rng.uniform(1, 200, 500).astype(int),
"date": pd.date_range("2022-01-01", freq="M", periods=500),
"rating": random.choices(range(50), k=500),
"target": random.choices([0, 1], k=500),
}
)= data.drop("target", axis=1), data["target"]
X, y = PoniardClassifier().setup(X, y)
pnd pnd.preprocessor
Setup info
Target
Type: binary
Shape: (500,)
Unique values: 2
Metrics
Main metric: roc_aucFeature type inference
Minimum unique values to consider a number-like feature numeric: 50
Minimum unique values to consider a categorical feature high cardinality: 20
Inferred feature types:
numeric | categorical_high | categorical_low | datetime | |
---|---|---|---|---|
0 | age | rating | type | date |
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', hand... SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='classification'))]), ['rating']), ('datetime_preprocessor', Pipeline(steps=[('datetime_encoder', DatetimeEncoder()), ('datetime_imputer', SimpleImputer(strategy='most_frequent'))]), ['date'])])), ('remove_invariant', VarianceThreshold())], verbose=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', hand... SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='classification'))]), ['rating']), ('datetime_preprocessor', Pipeline(steps=[('datetime_encoder', DatetimeEncoder()), ('datetime_imputer', SimpleImputer(strategy='most_frequent'))]), ['date'])])), ('remove_invariant', VarianceThreshold())], verbose=0)
ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False))]),... ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='classification'))]), ['rating']), ('datetime_preprocessor', Pipeline(steps=[('datetime_encoder', DatetimeEncoder()), ('datetime_imputer', SimpleImputer(strategy='most_frequent'))]), ['date'])])
['age']
SimpleImputer()
StandardScaler()
['type']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False)
['rating']
SimpleImputer(strategy='most_frequent')
TargetEncoder(handle_unknown='ignore', task='classification')
['date']
DatetimeEncoder()
SimpleImputer(strategy='most_frequent')
VarianceThreshold()
Type inference
Type inference is governed by the input data types and two thresholds.
Number features (as defined by numpy
) with unique values greater than numeric_threshold
will be treated as numeric, with the remainder being treated as non-numeric. If this parameter is a float, the actual threshold is numeric_threshold * samples
.
Non-numeric features (either because they are number features below numeric_threshold
or they are non-number features like strings) with unique values greater than cardinality_threshold
will be considered high cardinality. Likewise, in the case of a float value, the threshold is cardinality_threshold * samples
.
These thresholds are part of PoniardPreprocessor
, which is the default preprocessor used in PoniardBaseEstimator
.
Defaults are set at reasonable limits, but do pay attention to the output of PoniardBaseEstimator.setup
as it might expose misclassified features. In that scenario there’s three options:
- pass a
PoniardPreprocessor
with different thresholds that better acommodate the dataset tocustom_preprocessor
. - pass a sklearn transformer (including pipelines) to
custom_preprocessor
that applies appropiate transformations to different sets of features. - use the
PoniardBaseEstimator.reassign_types
method to explicitly assign features to the three categories.
In the following example, PoniardBaseEstimator.reassign_types
is used to make every feature numeric as far as preprocessing goes.
from sklearn.datasets import fetch_california_housing
from poniard import PoniardRegressor
= fetch_california_housing(return_X_y=True, as_frame=True)
X, y = PoniardRegressor()
reg
reg.setup(X, y) reg.preprocessor
Setup info
Target
Type: continuous
Shape: (20640,)
Unique values: 3842
Metrics
Main metric: neg_mean_squared_errorFeature type inference
Minimum unique values to consider a number-like feature numeric: 2064
Minimum unique values to consider a categorical feature high cardinality: 20
Inferred feature types:
numeric | categorical_high | categorical_low | datetime | |
---|---|---|---|---|
0 | MedInc | HouseAge | ||
1 | AveRooms | Latitude | ||
2 | AveBedrms | Longitude | ||
3 | Population | |||
4 | AveOccup |
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='regression'))]), ['HouseAge', 'Latitude', 'Longitude'])])), ('remove_invariant', VarianceThreshold())], verbose=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='regression'))]), ['HouseAge', 'Latitude', 'Longitude'])])), ('remove_invariant', VarianceThreshold())], verbose=0)
ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='regression'))]), ['HouseAge', 'Latitude', 'Longitude'])])
['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup']
SimpleImputer()
StandardScaler()
['HouseAge', 'Latitude', 'Longitude']
SimpleImputer(strategy='most_frequent')
TargetEncoder(handle_unknown='ignore', task='regression')
VarianceThreshold()
reg.reassign_types(=[
numeric"HouseAge",
"Latitude",
"Longitude",
]
) reg.preprocessor
Assigned feature types:
numeric | categorical_high | categorical_low | datetime | |
---|---|---|---|---|
0 | MedInc | |||
1 | AveRooms | |||
2 | AveBedrms | |||
3 | Population | |||
4 | AveOccup | |||
5 | HouseAge | |||
6 | Latitude | |||
7 | Longitude |
Pipeline(steps=[('type_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])), ('remove_invariant', VarianceThreshold())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('type_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])), ('remove_invariant', VarianceThreshold())])
Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])
SimpleImputer()
StandardScaler()
VarianceThreshold()
Modifying the default preprocessor during construction
Combining properly setup feature types with the scaler
, numeric_imputer
and high_cardinality_encoder
parameters in PoniardPreprocessor
allows almost complete customization of the default preprocessing pipeline.
These three parameters take strings representing transformers (as in scaler="minmax"
will use scikit-learn’s MinMaxScaler
, see the reference), and also accept scikit-learn transformers and pipelines.
For now, we are deliberately not providing options for the categorical imputer (a SimpleImputer(strategy="most_frequent")
is used) or the low cardinality categorical encoder (always OneHotEncoder(drop="if_binary", handle_unknown="ignore", sparse=False)
). While this is not set in stone, we feel that these are less debatable.
from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
from poniard import PoniardRegressor
from poniard.preprocessing import PoniardPreprocessor
= PoniardPreprocessor(numeric_imputer=KNNImputer(), scaler="robust")
preprocessor = fetch_california_housing(return_X_y=True, as_frame=True)
X, y = PoniardRegressor(custom_preprocessor=preprocessor)
reg
reg.setup(X, y)
reg.reassign_types(=[
numeric"AveRooms",
"AveBedrms",
"Population",
"AveOccup",
"Latitude",
"Longitude",
],=["HouseAge"],
categorical_high
) reg.preprocessor
Setup info
Target
Type: continuous
Shape: (20640,)
Unique values: 3842
Metrics
Main metric: neg_mean_squared_errorFeature type inference
Minimum unique values to consider a number-like feature numeric: 2064
Minimum unique values to consider a categorical feature high cardinality: 20
Inferred feature types:
numeric | categorical_high | categorical_low | datetime | |
---|---|---|---|---|
0 | MedInc | HouseAge | ||
1 | AveRooms | Latitude | ||
2 | AveBedrms | Longitude | ||
3 | Population | |||
4 | AveOccup |
Assigned feature types:
numeric | categorical_high | categorical_low | datetime | |
---|---|---|---|---|
0 | MedInc | HouseAge | ||
1 | AveRooms | |||
2 | AveBedrms | |||
3 | Population | |||
4 | AveOccup | |||
5 | Latitude | |||
6 | Longitude |
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', KNNImputer()), ('scaler', RobustScaler())]), ['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='regression'))]), ['HouseAge'])])), ('remove_invariant', VarianceThreshold())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', KNNImputer()), ('scaler', RobustScaler())]), ['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='regression'))]), ['HouseAge'])])), ('remove_invariant', VarianceThreshold())])
ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', KNNImputer()), ('scaler', RobustScaler())]), ['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='regression'))]), ['HouseAge'])])
['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
KNNImputer()
RobustScaler()
['HouseAge']
SimpleImputer(strategy='most_frequent')
TargetEncoder(handle_unknown='ignore', task='regression')
VarianceThreshold()
Modifying the default preprocessor after construction
Transformers and pipelines can be added to an existing preprocessor in any position with PoniardBaseEstimator.add_preprocessing_step
from sklearn.feature_selection import SelectKBest, f_regression
reg.add_preprocessing_step("feature_selection", SelectKBest(f_regression, k=5)), position="end"
(
) reg.preprocessor
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', KNNImputer()), ('scaler', RobustScaler())]), ['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='regression'))]), ['HouseAge'])])), ('remove_invariant', VarianceThreshold()), ('feature_selection', SelectKBest(k=5, score_func=<function f_regression at 0x17c88bf70>))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', KNNImputer()), ('scaler', RobustScaler())]), ['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='regression'))]), ['HouseAge'])])), ('remove_invariant', VarianceThreshold()), ('feature_selection', SelectKBest(k=5, score_func=<function f_regression at 0x17c88bf70>))])
ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', KNNImputer()), ('scaler', RobustScaler())]), ['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='regression'))]), ['HouseAge'])])
['MedInc', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']
KNNImputer()
RobustScaler()
['HouseAge']
SimpleImputer(strategy='most_frequent')
TargetEncoder(handle_unknown='ignore', task='regression')
VarianceThreshold()
SelectKBest(k=5, score_func=<function f_regression at 0x17c88bf70>)
Use a custom sklearn preprocessor
During init of either PoniardRegressor
or PoniardClassifier
(see docs for PoniardBaseEstimator
which sets up most of the functionality), preprocess=False
disables preprocessing altogether, while custom_preprocessor
accepts a scikit-learn transformer (or pipeline/column transformer) that replaces the default Poniard transformation pipeline.
Logically, there is no type inference involved when these options are used and full control is given to the user.
In the following example, we use TfidfVectorizer
and Normalizer
to process the 20 News Groups dataset.
from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from poniard import PoniardClassifier
= fetch_20newsgroups(
X, y =True,
return_X_y=("headers", "footers", "quotes"),
remove=("sci.crypt", "sci.electronics", "sci.med"),
categories
)= make_pipeline(TfidfVectorizer(), Normalizer())
preprocessor = PoniardClassifier(
pnd =[LogisticRegression()], custom_preprocessor=preprocessor
estimators
)
pnd.setup(X, y) pnd.preprocessor
Setup info
Target
Type: multiclass
Shape: (1780,)
Unique values: 3
Metrics
Main metric: roc_auc_ovrPipeline(steps=[('tfidfvectorizer', TfidfVectorizer()), ('normalizer', Normalizer())], verbose=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()), ('normalizer', Normalizer())], verbose=0)
TfidfVectorizer()
Normalizer()
pnd.fit() pnd.get_results()