Poniard Preprocessor

Preprocessing data based on input types.

source

PoniardPreprocessor

 PoniardPreprocessor (task:Optional[str]=None,
                      scaler:Optional[Union[str,TransformerMixin]]=None, h
                      igh_cardinality_encoder:Optional[Union[str,Transform
                      erMixin]]=None, numeric_imputer:Optional[Union[str,T
                      ransformerMixin]]=None, custom_preprocessor:Union[No
                      ne,Pipeline,TransformerMixin]=None,
                      numeric_threshold:Union[int,float]=0.1,
                      cardinality_threshold:Union[int,float]=20,
                      verbose:int=0, random_state:Optional[int]=None,
                      n_jobs:Optional[int]=None,
                      cache_transformations:bool=False)

Base preprocessor that builds an easily modifiable pipeline based on feature data types.

Type Default Details
task Optional[str] None
scaler Optional[Union[str, TransformerMixin]] None Numeric scaler method. Either “standard”, “minmax”, “robust” or scikit-learn Transformer.
high_cardinality_encoder Optional[Union[str, TransformerMixin]] None Encoder for categorical features with high cardinality. Either “target” or “ordinal”,
or scikit-learn Transformer.
numeric_imputer Optional[Union[str, TransformerMixin]] None Imputation method. Either “simple”, “iterative” or scikit-learn Transformer.
custom_preprocessor Union[None, Pipeline, TransformerMixin] None
numeric_threshold Union[int, float] 0.1 Number features with unique values above a certain threshold will be treated as numeric. If
float, the threshold is numeric_threshold * samples.
cardinality_threshold Union[int, float] 20 Non-number features with cardinality above a certain threshold will be treated as
ordinal encoded instead of one-hot encoded. If float, the threshold is
cardinality_threshold * samples.
verbose int 0 Verbosity level. Propagated to every scikit-learn function and estimator.
random_state Optional[int] None RNG. Propagated to every scikit-learn function and estimator. The default None sets
random_state to 0 so that cross_validate results are comparable.
n_jobs Optional[int] None Controls parallel processing. -1 uses all cores. Propagated to every scikit-learn
function.
cache_transformations bool False Whether to cache transformations and set the memory parameter for Pipelines. This can
speed up slow transformations as they are not recalculated for each estimator.

PoniardPreprocessor’s job is to build a preprocessing pipeline that fits the input data, both features and target. It does this by inferring the types of the features and selecting appropiate family of transformers for each group. The user is free to select which particular transformer to choose for each group, for example, by changing the default numeric scaler from StandardScaler to RobustScaler.

Customization is done through 3 parameters related to transformers (scaler, high_cardinality_encoder and numeric_imputer), which take standard sklearn-compatible transformers, and 2 parameters related to type inference (numeric_threshold and cardinality_threshold).

The latter work by separating features into buckets. In particular, numeric (int, float) features can be left as numeric or cast to a high cardinality categorical (if the number of unique values is below numeric_threshold), while categoricals can either by low or high cardinality (if the number of unique values exceeds categorical_threshold).


source

PoniardPreprocessor.build

 PoniardPreprocessor.build
                            (X:Union[pandas.core.frame.DataFrame,numpy.nda
                            rray,List,NoneType]=None, y:Union[pandas.core.
                            frame.DataFrame,numpy.ndarray,List,NoneType]=N
                            one)

Builds the preprocessor according to the input data.

Gets the data from the main PoniardBaseEstimator (if available) or processes the input data, calls the type inference method, sets up the transformers and builds the pipeline.

Type Default Details
X Optional[Union[pd.DataFrame, np.ndarray, List]] None Features
y Optional[Union[pd.DataFrame, np.ndarray, List]] None Target.
Returns PoniardPreprocessor
random.seed(0)
rng = np.random.default_rng(0)

data = pd.DataFrame(
    {
        "type": random.choices(["house", "apartment"], k=500),
        "age": rng.uniform(1, 200, 500).astype(int),
        "date": pd.date_range("2022-01-01", freq="M", periods=500),
        "rating": random.choices(range(50), k=500),
        "target": random.choices([0, 1], k=500),
    }
)
data.head()
type age date rating target
0 apartment 127 2022-01-31 1 1
1 apartment 54 2022-02-28 17 1
2 house 9 2022-03-31 0 1
3 house 4 2022-04-30 48 1
4 apartment 162 2022-05-31 40 0

If running a standalone PoniardPreprocessor, task (either “classification” or “regression”) has to be specified in the constructor.

X, y = data.drop("target", axis=1), data["target"]
prep = PoniardPreprocessor(task="classification").build(X, y)

The actual preprocessing pipeline is held within the preprocessor attribute.

prep.preprocessor
Pipeline(steps=[('type_preprocessor',
                 ColumnTransformer(transformers=[('numeric_preprocessor',
                                                  Pipeline(steps=[('numeric_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age']),
                                                 ('categorical_low_preprocessor',
                                                  Pipeline(steps=[('categorical_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot_encoder',
                                                                   OneHotEncoder(drop='if_binary',
                                                                                 hand...
                                                  Pipeline(steps=[('categorical_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('high_cardinality_encoder',
                                                                   TargetEncoder(handle_unknown='ignore',
                                                                                 task='classification'))]),
                                                  ['rating']),
                                                 ('datetime_preprocessor',
                                                  Pipeline(steps=[('datetime_encoder',
                                                                   DatetimeEncoder()),
                                                                  ('datetime_imputer',
                                                                   SimpleImputer(strategy='most_frequent'))]),
                                                  ['date'])])),
                ('remove_invariant', VarianceThreshold())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

PoniardPreprocessor is included by default in and tightly coupled with PoniardBaseEstimator. During PoniardBaseEstimator.setup a preprocessor instance will be initialized, and the whole estimator instance will be passed to the preprocessor’s _poniard attribute, giving it access to the data. Likewise, running PoniardBaseEstimator.reassign_types and PoniardBaseEstimator.add_preprocessing_step will trigger changes in PoniardPreprocessor.

from poniard import PoniardClassifier
from poniard.preprocessing import PoniardPreprocessor
X, y = data.drop("target", axis=1), data["target"]
clf = PoniardClassifier().setup(X, y)

Setup info

Target

Type: binary

Shape: (500,)

Unique values: 2

Metrics

Main metric: roc_auc

Feature type inference

Minimum unique values to consider a number-like feature numeric: 50

Minimum unique values to consider a categorical feature high cardinality: 20

Inferred feature types:

numeric categorical_high categorical_low datetime
0 age rating type date
clf._poniard_preprocessor
PoniardPreprocessor()
clf._poniard_preprocessor.preprocessor
Pipeline(steps=[('type_preprocessor',
                 ColumnTransformer(transformers=[('numeric_preprocessor',
                                                  Pipeline(steps=[('numeric_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age']),
                                                 ('categorical_low_preprocessor',
                                                  Pipeline(steps=[('categorical_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot_encoder',
                                                                   OneHotEncoder(drop='if_binary',
                                                                                 hand...
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('high_cardinality_encoder',
                                                                   TargetEncoder(handle_unknown='ignore',
                                                                                 task='classification'))]),
                                                  ['rating']),
                                                 ('datetime_preprocessor',
                                                  Pipeline(steps=[('datetime_encoder',
                                                                   DatetimeEncoder()),
                                                                  ('datetime_imputer',
                                                                   SimpleImputer(strategy='most_frequent'))]),
                                                  ['date'])])),
                ('remove_invariant', VarianceThreshold())],
         verbose=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

However, a custom instance of PoniardPreprocessor can be passed to estimators.

custom = PoniardPreprocessor(scaler="robust", numeric_imputer="iterative")
clf = PoniardClassifier(custom_preprocessor=custom).setup(X, y, show_info=False)
clf.fit()
PoniardClassifier(custom_preprocessor=PoniardPreprocessor(scaler='robust', numeric_imputer='iterative'))