Preprocessing

Learn how Poniard preprocessors can be modified to fit different use cases and datasets

Introduction

Poniard tries to apply minimal preprocessing to data. In general, it just tries to make sure that models fit correctly without introducing signifcant transformation overhead. In particular, there is no anomaly detection, dimensionality reduction, clustering, resampling, feature creation from polynomial interactions, feature selection, etc.

This is so the user always knows what’s going on.

However, the default options may not be suitable for your data or objectives, so these can be set during initialization or modified afterwards.

Default preprocessing pipeline

The list of default transformations is:

  • Missing data imputation.
  • Z-score scaling for numeric variables.
  • One-hot encoding for low cardinality categorical variables.
  • Target encoding for the remaining categorical variables. This is a custom transformer based on Micci-Barreca, 2001, with implementation heavily based on Dirty Cat. If the task is multilabel or multioutput, ordinal encoding will be used instead.
  • Datetime encoding for datetime variables. This also uses a custom transformer that extracts multiple datetime levels.
  • Zero-variance feature elimination.

This includes some type inference logic that decides whether a given feature is either numeric, categorical high cardinality, categorical low cardinality or datetime (see Type inference).

import random

import pandas as pd
import numpy as np
from poniard import PoniardClassifier
random.seed(0)
rng = np.random.default_rng(0)

data = pd.DataFrame(
    {
        "type": random.choices(["house", "apartment"], k=500),
        "age": rng.uniform(1, 200, 500).astype(int),
        "date": pd.date_range("2022-01-01", freq="M", periods=500),
        "rating": random.choices(range(50), k=500),
        "target": random.choices([0, 1], k=500),
    }
)
X, y = data.drop("target", axis=1), data["target"]
pnd = PoniardClassifier().setup(X, y)
pnd.preprocessor

Setup info

Target

Type: binary

Shape: (500,)

Unique values: 2

Metrics

Main metric: roc_auc

Feature type inference

Minimum unique values to consider a number-like feature numeric: 50

Minimum unique values to consider a categorical feature high cardinality: 20

Inferred feature types:

numeric categorical_high categorical_low datetime
0 age rating type date
Pipeline(steps=[('type_preprocessor',
                 ColumnTransformer(transformers=[('numeric_preprocessor',
                                                  Pipeline(steps=[('numeric_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age']),
                                                 ('categorical_low_preprocessor',
                                                  Pipeline(steps=[('categorical_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot_encoder',
                                                                   OneHotEncoder(drop='if_binary',
                                                                                 hand...
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('high_cardinality_encoder',
                                                                   TargetEncoder(handle_unknown='ignore',
                                                                                 task='classification'))]),
                                                  ['rating']),
                                                 ('datetime_preprocessor',
                                                  Pipeline(steps=[('datetime_encoder',
                                                                   DatetimeEncoder()),
                                                                  ('datetime_imputer',
                                                                   SimpleImputer(strategy='most_frequent'))]),
                                                  ['date'])])),
                ('remove_invariant', VarianceThreshold())],
         verbose=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Empty subpreprocessors

If no features are assigned to a subpreprocessor (like datetime_preprocessor or categorical_low_preprocessor), then it will be dropped. This does not affect results as scikit-learn effectively ignores transformers with no assigned features, but it makes the HTML representation cleaner.

Type inference

Type inference is governed by the input data types and two thresholds.

Number features (as defined by numpy) with unique values greater than numeric_threshold will be treated as numeric, with the remainder being treated as non-numeric. If this parameter is a float, the actual threshold is numeric_threshold * samples.

Non-numeric features (either because they are number features below numeric_threshold or they are non-number features like strings) with unique values greater than cardinality_threshold will be considered high cardinality. Likewise, in the case of a float value, the threshold is cardinality_threshold * samples.

These thresholds are part of PoniardPreprocessor, which is the default preprocessor used in PoniardBaseEstimator.

Defaults are set at reasonable limits, but do pay attention to the output of PoniardBaseEstimator.setup as it might expose misclassified features. In that scenario there’s three options:

  1. pass a PoniardPreprocessor with different thresholds that better acommodate the dataset to custom_preprocessor.
  2. pass a sklearn transformer (including pipelines) to custom_preprocessor that applies appropiate transformations to different sets of features.
  3. use the PoniardBaseEstimator.reassign_types method to explicitly assign features to the three categories.

In the following example, PoniardBaseEstimator.reassign_types is used to make every feature numeric as far as preprocessing goes.

from sklearn.datasets import fetch_california_housing
from poniard import PoniardRegressor
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
reg = PoniardRegressor()
reg.setup(X, y)
reg.preprocessor

Setup info

Target

Type: continuous

Shape: (20640,)

Unique values: 3842

Metrics

Main metric: neg_mean_squared_error

Feature type inference

Minimum unique values to consider a number-like feature numeric: 2064

Minimum unique values to consider a categorical feature high cardinality: 20

Inferred feature types:

numeric categorical_high categorical_low datetime
0 MedInc HouseAge
1 AveRooms Latitude
2 AveBedrms Longitude
3 Population
4 AveOccup
Pipeline(steps=[('type_preprocessor',
                 ColumnTransformer(transformers=[('numeric_preprocessor',
                                                  Pipeline(steps=[('numeric_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['MedInc', 'AveRooms',
                                                   'AveBedrms', 'Population',
                                                   'AveOccup']),
                                                 ('categorical_high_preprocessor',
                                                  Pipeline(steps=[('categorical_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('high_cardinality_encoder',
                                                                   TargetEncoder(handle_unknown='ignore',
                                                                                 task='regression'))]),
                                                  ['HouseAge', 'Latitude',
                                                   'Longitude'])])),
                ('remove_invariant', VarianceThreshold())],
         verbose=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
reg.reassign_types(
    numeric=[
        "HouseAge",
        "Latitude",
        "Longitude",
    ]
)
reg.preprocessor

Assigned feature types:

numeric categorical_high categorical_low datetime
0 MedInc
1 AveRooms
2 AveBedrms
3 Population
4 AveOccup
5 HouseAge
6 Latitude
7 Longitude
Pipeline(steps=[('type_preprocessor',
                 Pipeline(steps=[('numeric_imputer', SimpleImputer()),
                                 ('scaler', StandardScaler())])),
                ('remove_invariant', VarianceThreshold())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Modifying the default preprocessor during construction

Combining properly setup feature types with the scaler, numeric_imputer and high_cardinality_encoder parameters in PoniardPreprocessor allows almost complete customization of the default preprocessing pipeline.

These three parameters take strings representing transformers (as in scaler="minmax" will use scikit-learn’s MinMaxScaler, see the reference), and also accept scikit-learn transformers and pipelines.

For now, we are deliberately not providing options for the categorical imputer (a SimpleImputer(strategy="most_frequent") is used) or the low cardinality categorical encoder (always OneHotEncoder(drop="if_binary", handle_unknown="ignore", sparse=False)). While this is not set in stone, we feel that these are less debatable.

from sklearn.datasets import fetch_california_housing
from sklearn.impute import KNNImputer
from poniard import PoniardRegressor
from poniard.preprocessing import PoniardPreprocessor
preprocessor = PoniardPreprocessor(numeric_imputer=KNNImputer(), scaler="robust")
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
reg = PoniardRegressor(custom_preprocessor=preprocessor)
reg.setup(X, y)
reg.reassign_types(
    numeric=[
        "AveRooms",
        "AveBedrms",
        "Population",
        "AveOccup",
        "Latitude",
        "Longitude",
    ],
    categorical_high=["HouseAge"],
)
reg.preprocessor

Setup info

Target

Type: continuous

Shape: (20640,)

Unique values: 3842

Metrics

Main metric: neg_mean_squared_error

Feature type inference

Minimum unique values to consider a number-like feature numeric: 2064

Minimum unique values to consider a categorical feature high cardinality: 20

Inferred feature types:

numeric categorical_high categorical_low datetime
0 MedInc HouseAge
1 AveRooms Latitude
2 AveBedrms Longitude
3 Population
4 AveOccup

Assigned feature types:

numeric categorical_high categorical_low datetime
0 MedInc HouseAge
1 AveRooms
2 AveBedrms
3 Population
4 AveOccup
5 Latitude
6 Longitude
Pipeline(steps=[('type_preprocessor',
                 ColumnTransformer(transformers=[('numeric_preprocessor',
                                                  Pipeline(steps=[('numeric_imputer',
                                                                   KNNImputer()),
                                                                  ('scaler',
                                                                   RobustScaler())]),
                                                  ['MedInc', 'AveRooms',
                                                   'AveBedrms', 'Population',
                                                   'AveOccup', 'Latitude',
                                                   'Longitude']),
                                                 ('categorical_high_preprocessor',
                                                  Pipeline(steps=[('categorical_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('high_cardinality_encoder',
                                                                   TargetEncoder(handle_unknown='ignore',
                                                                                 task='regression'))]),
                                                  ['HouseAge'])])),
                ('remove_invariant', VarianceThreshold())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Modifying the default preprocessor after construction

Transformers and pipelines can be added to an existing preprocessor in any position with PoniardBaseEstimator.add_preprocessing_step

from sklearn.feature_selection import SelectKBest, f_regression
reg.add_preprocessing_step(
    ("feature_selection", SelectKBest(f_regression, k=5)), position="end"
)
reg.preprocessor
Pipeline(steps=[('type_preprocessor',
                 ColumnTransformer(transformers=[('numeric_preprocessor',
                                                  Pipeline(steps=[('numeric_imputer',
                                                                   KNNImputer()),
                                                                  ('scaler',
                                                                   RobustScaler())]),
                                                  ['MedInc', 'AveRooms',
                                                   'AveBedrms', 'Population',
                                                   'AveOccup', 'Latitude',
                                                   'Longitude']),
                                                 ('categorical_high_preprocessor',
                                                  Pipeline(steps=[('categorical_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('high_cardinality_encoder',
                                                                   TargetEncoder(handle_unknown='ignore',
                                                                                 task='regression'))]),
                                                  ['HouseAge'])])),
                ('remove_invariant', VarianceThreshold()),
                ('feature_selection',
                 SelectKBest(k=5,
                             score_func=<function f_regression at 0x17c88bf70>))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Use a custom sklearn preprocessor

During init of either PoniardRegressor or PoniardClassifier (see docs for PoniardBaseEstimator which sets up most of the functionality), preprocess=False disables preprocessing altogether, while custom_preprocessor accepts a scikit-learn transformer (or pipeline/column transformer) that replaces the default Poniard transformation pipeline.

Logically, there is no type inference involved when these options are used and full control is given to the user.

In the following example, we use TfidfVectorizer and Normalizer to process the 20 News Groups dataset.

from sklearn.datasets import fetch_20newsgroups
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Normalizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

from poniard import PoniardClassifier
X, y = fetch_20newsgroups(
    return_X_y=True,
    remove=("headers", "footers", "quotes"),
    categories=("sci.crypt", "sci.electronics", "sci.med"),
)
preprocessor = make_pipeline(TfidfVectorizer(), Normalizer())
pnd = PoniardClassifier(
    estimators=[LogisticRegression()], custom_preprocessor=preprocessor
)
pnd.setup(X, y)
pnd.preprocessor

Setup info

Target

Type: multiclass

Shape: (1780,)

Unique values: 3

Metrics

Main metric: roc_auc_ovr
Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                ('normalizer', Normalizer())],
         verbose=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
pnd.fit()
pnd.get_results()