Poniard Preprocessor

Preprocessing data based on input types.

PoniardPreprocessor

 PoniardPreprocessor (task:Optional[str]=None,
                      scaler:Optional[Union[str,TransformerMixin]]=None, h
                      igh_cardinality_encoder:Optional[Union[str,Transform
                      erMixin]]=None, numeric_imputer:Optional[Union[str,T
                      ransformerMixin]]=None, custom_preprocessor:Union[No
                      ne,Pipeline,TransformerMixin]=None,
                      numeric_threshold:Union[int,float]=0.1,
                      cardinality_threshold:Union[int,float]=20,
                      verbose:int=0, random_state:Optional[int]=None,
                      n_jobs:Optional[int]=None,
                      cache_transformations:bool=False)

Base preprocessor that builds an easily modifiable pipeline based on feature data types.

	Type	Default	Details
task	Optional[str]	None
scaler	Optional[Union[str, TransformerMixin]]	None	Numeric scaler method. Either “standard”, “minmax”, “robust” or scikit-learn Transformer.
high_cardinality_encoder	Optional[Union[str, TransformerMixin]]	None	Encoder for categorical features with high cardinality. Either “target” or “ordinal”, or scikit-learn Transformer.
numeric_imputer	Optional[Union[str, TransformerMixin]]	None	Imputation method. Either “simple”, “iterative” or scikit-learn Transformer.
custom_preprocessor	Union[None, Pipeline, TransformerMixin]	None
numeric_threshold	Union[int, float]	0.1	Number features with unique values above a certain threshold will be treated as numeric. If float, the threshold is `numeric_threshold * samples`.
cardinality_threshold	Union[int, float]	20	Non-number features with cardinality above a certain threshold will be treated as ordinal encoded instead of one-hot encoded. If float, the threshold is `cardinality_threshold * samples`.
verbose	int	0	Verbosity level. Propagated to every scikit-learn function and estimator.
random_state	Optional[int]	None	RNG. Propagated to every scikit-learn function and estimator. The default None sets random_state to 0 so that cross_validate results are comparable.
n_jobs	Optional[int]	None	Controls parallel processing. -1 uses all cores. Propagated to every scikit-learn function.
cache_transformations	bool	False	Whether to cache transformations and set the `memory` parameter for Pipelines. This can speed up slow transformations as they are not recalculated for each estimator.

PoniardPreprocessor’s job is to build a preprocessing pipeline that fits the input data, both features and target. It does this by inferring the types of the features and selecting appropiate family of transformers for each group. The user is free to select which particular transformer to choose for each group, for example, by changing the default numeric scaler from StandardScaler to RobustScaler.

Customization is done through 3 parameters related to transformers (scaler, high_cardinality_encoder and numeric_imputer), which take standard sklearn-compatible transformers, and 2 parameters related to type inference (numeric_threshold and cardinality_threshold).

The latter work by separating features into buckets. In particular, numeric (int, float) features can be left as numeric or cast to a high cardinality categorical (if the number of unique values is below numeric_threshold), while categoricals can either by low or high cardinality (if the number of unique values exceeds categorical_threshold).

source

PoniardPreprocessor.build

 PoniardPreprocessor.build
                            (X:Union[pandas.core.frame.DataFrame,numpy.nda
                            rray,List,NoneType]=None, y:Union[pandas.core.
                            frame.DataFrame,numpy.ndarray,List,NoneType]=N
                            one)

Builds the preprocessor according to the input data.

Gets the data from the main PoniardBaseEstimator (if available) or processes the input data, calls the type inference method, sets up the transformers and builds the pipeline.

	Type	Default	Details
X	Optional[Union[pd.DataFrame, np.ndarray, List]]	None	Features
y	Optional[Union[pd.DataFrame, np.ndarray, List]]	None	Target.
Returns	PoniardPreprocessor

random.seed(0)
rng = np.random.default_rng(0)

data = pd.DataFrame(
    {
        "type": random.choices(["house", "apartment"], k=500),
        "age": rng.uniform(1, 200, 500).astype(int),
        "date": pd.date_range("2022-01-01", freq="M", periods=500),
        "rating": random.choices(range(50), k=500),
        "target": random.choices([0, 1], k=500),
    }
)
data.head()

	type	age	date	rating	target
0	apartment	127	2022-01-31	1	1
1	apartment	54	2022-02-28	17	1
2	house	9	2022-03-31	0	1
3	house	4	2022-04-30	48	1
4	apartment	162	2022-05-31	40	0

If running a standalone PoniardPreprocessor, task (either “classification” or “regression”) has to be specified in the constructor.

X, y = data.drop("target", axis=1), data["target"]
prep = PoniardPreprocessor(task="classification").build(X, y)

The actual preprocessing pipeline is held within the preprocessor attribute.

prep.preprocessor

PoniardPreprocessor is included by default in and tightly coupled with PoniardBaseEstimator. During PoniardBaseEstimator.setup a preprocessor instance will be initialized, and the whole estimator instance will be passed to the preprocessor’s _poniard attribute, giving it access to the data. Likewise, running PoniardBaseEstimator.reassign_types and PoniardBaseEstimator.add_preprocessing_step will trigger changes in PoniardPreprocessor.

from poniard import PoniardClassifier
from poniard.preprocessing import PoniardPreprocessor

X, y = data.drop("target", axis=1), data["target"]
clf = PoniardClassifier().setup(X, y)

Setup info

Target

Type: binary

Shape: (500,)

Unique values: 2

Metrics

Main metric: roc_auc

Feature type inference

Minimum unique values to consider a number-like feature numeric: 50

Minimum unique values to consider a categorical feature high cardinality: 20

Inferred feature types:

	numeric	categorical_high	categorical_low	datetime
0	age	rating	type	date

clf._poniard_preprocessor

PoniardPreprocessor()

clf._poniard_preprocessor.preprocessor

However, a custom instance of PoniardPreprocessor can be passed to estimators.

custom = PoniardPreprocessor(scaler="robust", numeric_imputer="iterative")
clf = PoniardClassifier(custom_preprocessor=custom).setup(X, y, show_info=False)
clf.fit()

PoniardClassifier(custom_preprocessor=PoniardPreprocessor(scaler='robust', numeric_imputer='iterative'))