from poniard import PoniardClassifier
Base estimator
PoniardBaseEstimator
sets up 95% of the funcionality for PoniardClassifier
and PoniardRegressor
.
PoniardBaseEstimator
PoniardBaseEstimator (estimators:Optional[Union[Sequence[ClassifierMixin] ,Dict[str,ClassifierMixin],Sequence[RegressorMixin] ,Dict[str,RegressorMixin]]]=None, metrics:Optional[ Union[str,Dict[str,Callable],Sequence[str]]]=None, preprocess:bool=True, custom_preprocessor:Union[Non e,Pipeline,TransformerMixin,PoniardPreprocessor]=No ne, cv:Union[int,BaseCrossValidator,BaseShuffleSpli t,Sequence]=None, verbose:int=0, random_state:Optional[int]=None, n_jobs:Optional[int]=None, plugins:Optional[Sequence[Any]]=None, plot_options:Optional[PoniardPlotFactory]=None)
Base estimator that sets up all the functionality for the classifier and regressor.
Type | Default | Details | |
---|---|---|---|
estimators | Optional[Union[Sequence[ClassifierMixin], Dict[str, ClassifierMixin], Sequence[RegressorMixin], Dict[str, RegressorMixin]]] | None | Estimators to evaluate. |
metrics | Optional[Union[str, Dict[str, Callable], Sequence[str]]] | None | Metrics to compute for each estimator. This is more restrictive than sklearn’s scoring parameter, as it does not allow callable scorers. Single strings are cast to lists automatically. |
preprocess | bool | True | If True, impute missing values, standard scale numeric data and one-hot or ordinal encode categorical data. |
custom_preprocessor | Union[None, Pipeline, TransformerMixin, PoniardPreprocessor] | None | Preprocessor used instead of the default preprocessing pipeline. It must be able to be included directly in a scikit-learn Pipeline. |
cv | Union[int, BaseCrossValidator, BaseShuffleSplit, Sequence] | None | Cross validation strategy. Either an integer, a scikit-learn cross validation object, or an iterable. |
verbose | int | 0 | Verbosity level. Propagated to every scikit-learn function and estimator. |
random_state | Optional[int] | None | RNG. Propagated to every scikit-learn function and estimator. The default None sets random_state to 0 so that cross_validate results are comparable. |
n_jobs | Optional[int] | None | Controls parallel processing. -1 uses all cores. Propagated to every scikit-learn function. |
plugins | Optional[Sequence[Any]] | None | Plugin instances that run in set moments of setup, fit and plotting. |
plot_options | Optional[PoniardPlotFactory] | None | :class:poniard.plot.plot_factory.PoniardPlotFactory instance specifying Plotly format options or None, which sets the default factory. |
See the guides on Getting started, Main parameters and Preprocessing for examples on how the constructor parameters work.
Main methods
PoniardBaseEstimator.setup
PoniardBaseEstimator.setup (X:Union[pandas.core.frame.DataFrame,numpy.nd array,List], y:Union[pandas.core.frame.DataFr ame,numpy.ndarray,List], show_info:bool=True)
Acts as an orchestrator for Poniard estimators by setting up everything neeeded for PoniardBaseEstimator.fit
.
Converts inputs to arrays if necessary, sets metrics
, preprocessor
, cv
and pipelines
.
After running PoniardBaseEstimator.setup
, both X
and y
will be held as attributes.
Type | Default | Details | |
---|---|---|---|
X | Union[pd.DataFrame, np.ndarray, List] | Features | |
y | Union[pd.DataFrame, np.ndarray, List] | Target. | |
show_info | bool | True | Whether to print information about the target, metrics and type inference. |
Returns | PoniardBaseEstimator |
PoniardBaseEstimator.setup
takes features and target as parameters, while PoniardBaseEstimator.fit
does not accept any. This runs contrary to the established convention defined by scikit-learn where there is no setting up to do and fit
takes the data as params.
This is because Poniard does not only fit the models, but also infer features types and create the preprocessor
based on these types. While this could all be stuffed inside PoniardBaseEstimator.fit
(that was the case initially), having it separated allows the user to check whether Poniard’s assumptions are correct and adjust if needed before running fit
, which can take long depending on how many models were passed to estimators
, the cross validation strategy and the size of the dataset.
PoniardBaseEstimator
by default includes a PoniardPreprocessor
that handles building the preprocessor that will go into final estimation pipelines. However, a PoniardPreprocessor
with custom parameters can be used as a custom_preprocessor
.
An example
Let’s load some random data and setup a PoniardClassifier
, which inherits from PoniardBaseEstimator
.
0)
random.seed(= np.random.default_rng(0)
rng
= pd.DataFrame(
data
{"type": random.choices(["house", "apartment"], k=500),
"age": rng.uniform(1, 200, 500).astype(int),
"date": pd.date_range("2022-01-01", freq="M", periods=500),
"rating": random.choices(range(50), k=500),
"target": random.choices([0, 1], k=500),
}
) data.head()
type | age | date | rating | target | |
---|---|---|---|---|---|
0 | apartment | 127 | 2022-01-31 | 1 | 1 |
1 | apartment | 54 | 2022-02-28 | 17 | 1 |
2 | house | 9 | 2022-03-31 | 0 | 1 |
3 | house | 4 | 2022-04-30 | 48 | 1 |
4 | apartment | 162 | 2022-05-31 | 40 | 0 |
Information about the data will be shown so it can be reviewed and changes can be made.
= data.drop("target", axis=1), data["target"]
X, y = PoniardClassifier()
pnd pnd.setup(X, y)
Setup info
Target
Type: binary
Shape: (500,)
Unique values: 2
Metrics
Main metric: roc_aucFeature type inference
Minimum unique values to consider a number-like feature numeric: 50
Minimum unique values to consider a categorical feature high cardinality: 20
Inferred feature types:
numeric | categorical_high | categorical_low | datetime | |
---|---|---|---|---|
0 | age | rating | type | date |
PoniardClassifier()
After passing data to Poniard estimators through setup
, multiple attributes become available.
feature_types
is a dict
that sorts features in 4 categories (numeric, categorical_high, categorical_low and datetime) using some basic heuristics. This attribute is computed in PoniardPreprocessor.build
, and will not be available if a non-PoniardPreprocessor
transformer is passed to custom_preprocessor
.
Feature types depend on the feature dtypes
, and numeric_threshold
and cardinality_threshold
which are used in PoniardPreprocessor
’s construction.
pnd.feature_types
{'numeric': ['age'],
'categorical_high': ['rating'],
'categorical_low': ['type'],
'datetime': ['date']}
The preprocessor
can be the transformer produced by a PoniardPreprocessor
, which in turn depends on feature_types
, and the scaler
, numeric_imputer
and high_cardinality_encoder
parameters, or a user-supplied scikit-learn compatible transformer.
As will be seen further on, the PoniardPreprocessor
can be modified significantly to fit multiple use cases and datasets.
pnd.preprocessor
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', hand... SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='classification'))]), ['rating']), ('datetime_preprocessor', Pipeline(steps=[('datetime_encoder', DatetimeEncoder()), ('datetime_imputer', SimpleImputer(strategy='most_frequent'))]), ['date'])])), ('remove_invariant', VarianceThreshold())], verbose=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', hand... SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='classification'))]), ['rating']), ('datetime_preprocessor', Pipeline(steps=[('datetime_encoder', DatetimeEncoder()), ('datetime_imputer', SimpleImputer(strategy='most_frequent'))]), ['date'])])), ('remove_invariant', VarianceThreshold())], verbose=0)
ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False))]),... ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='classification'))]), ['rating']), ('datetime_preprocessor', Pipeline(steps=[('datetime_encoder', DatetimeEncoder()), ('datetime_imputer', SimpleImputer(strategy='most_frequent'))]), ['date'])])
['age']
SimpleImputer()
StandardScaler()
['type']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False)
['rating']
SimpleImputer(strategy='most_frequent')
TargetEncoder(handle_unknown='ignore', task='classification')
['date']
DatetimeEncoder()
SimpleImputer(strategy='most_frequent')
VarianceThreshold()
Each estimator has a set of default metrics
, but others can be passed during construction.
pnd.metrics
['roc_auc', 'accuracy', 'precision', 'recall', 'f1']
Likewise, cv
has sane defaults but can be modified accordingly.
pnd.cv
StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
target_info
lists information about y
.
pnd.target_info
{'type_': 'binary', 'ndim': 1, 'shape': (500,), 'nunique': 2}
pipelines
is a dict containing each pipeline which will be trained during fit
. Each Poniard estimator has a limited set of default estimators that are used if none are specified during initialization.
"SVC"] pnd.pipelines[
Pipeline(steps=[('preprocessor', Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', One... TargetEncoder(handle_unknown='ignore', task='classification'))]), ['rating']), ('datetime_preprocessor', Pipeline(steps=[('datetime_encoder', DatetimeEncoder()), ('datetime_imputer', SimpleImputer(strategy='most_frequent'))]), ['date'])])), ('remove_invariant', VarianceThreshold())], verbose=0)), ('SVC', SVC(kernel='linear', probability=True, random_state=0, verbose=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', One... TargetEncoder(handle_unknown='ignore', task='classification'))]), ['rating']), ('datetime_preprocessor', Pipeline(steps=[('datetime_encoder', DatetimeEncoder()), ('datetime_imputer', SimpleImputer(strategy='most_frequent'))]), ['date'])])), ('remove_invariant', VarianceThreshold())], verbose=0)), ('SVC', SVC(kernel='linear', probability=True, random_state=0, verbose=0))])
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', hand... SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='classification'))]), ['rating']), ('datetime_preprocessor', Pipeline(steps=[('datetime_encoder', DatetimeEncoder()), ('datetime_imputer', SimpleImputer(strategy='most_frequent'))]), ['date'])])), ('remove_invariant', VarianceThreshold())], verbose=0)
ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False))]),... ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='classification'))]), ['rating']), ('datetime_preprocessor', Pipeline(steps=[('datetime_encoder', DatetimeEncoder()), ('datetime_imputer', SimpleImputer(strategy='most_frequent'))]), ['date'])])
['age']
SimpleImputer()
StandardScaler()
['type']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False)
['rating']
SimpleImputer(strategy='most_frequent')
TargetEncoder(handle_unknown='ignore', task='classification')
['date']
DatetimeEncoder()
SimpleImputer(strategy='most_frequent')
VarianceThreshold()
SVC(kernel='linear', probability=True, random_state=0, verbose=0)
PoniardBaseEstimator.fit
PoniardBaseEstimator.fit ()
This is the main Poniard method. It uses scikit-learn’s cross_validate
function to score all metrics
for every pipelines
, using cv
for cross validation.
pnd.fit()
PoniardClassifier()
Because features and target are passed to the Poniard estimator, fit
does not take any parameters.
After fitting pipelines
, cross validated results can be accessed by running get_results
PoniardBaseEstimator.get_results
PoniardBaseEstimator.get_results (return_train_scores:bool=False, std:bool=False, wrt_dummy:bool=False)
Return dataframe containing scoring results. By default returns the mean score and fit and score times. Optionally returns standard deviations as well.
Type | Default | Details | |
---|---|---|---|
return_train_scores | bool | False | If False, only return test scores. |
std | bool | False | Whether to return standard deviation of the scores. Default False. |
wrt_dummy | bool | False | Whether to compute each score/time with respect to the dummy estimator results. Default False. |
Returns | Union[Tuple[pd.DataFrame, pd.DataFrame], pd.DataFrame] | Results |
pnd.get_results()
test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
---|---|---|---|---|---|---|---|
DecisionTreeClassifier | 0.510256 | 0.510 | 0.531145 | 0.503846 | 0.516707 | 0.010714 | 0.007243 |
DummyClassifier | 0.500000 | 0.520 | 0.520000 | 1.000000 | 0.684211 | 0.009618 | 0.007332 |
KNeighborsClassifier | 0.496675 | 0.492 | 0.509150 | 0.534615 | 0.519465 | 0.009883 | 0.008536 |
SVC | 0.472356 | 0.476 | 0.499007 | 0.688462 | 0.575907 | 0.715862 | 0.008426 |
LogisticRegression | 0.468990 | 0.488 | 0.509234 | 0.573077 | 0.536862 | 0.019850 | 0.007661 |
XGBClassifier | 0.460417 | 0.486 | 0.502401 | 0.500000 | 0.499330 | 0.046362 | 0.009421 |
HistGradientBoostingClassifier | 0.456571 | 0.488 | 0.505975 | 0.484615 | 0.494283 | 0.405131 | 0.019346 |
RandomForestClassifier | 0.435056 | 0.462 | 0.479861 | 0.476923 | 0.477449 | 0.070931 | 0.014314 |
GaussianNB | 0.423317 | 0.468 | 0.492473 | 0.565385 | 0.525371 | 0.010134 | 0.007401 |
= pnd.get_results(std=True, return_train_scores=True)
means, stds stds
test_roc_auc | train_roc_auc | test_accuracy | train_accuracy | test_precision | train_precision | test_recall | train_recall | test_f1 | train_f1 | fit_time | score_time | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
DecisionTreeClassifier | 0.060706 | 0.000000e+00 | 0.060332 | 0.000000 | 0.059942 | 0.000000 | 0.058835 | 0.000000 | 0.057785 | 0.000000 | 0.000303 | 0.000047 |
DummyClassifier | 0.000000 | 0.000000e+00 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000404 | 0.000100 |
KNeighborsClassifier | 0.021105 | 8.429609e-03 | 0.019391 | 0.010840 | 0.019140 | 0.008157 | 0.081043 | 0.022053 | 0.049760 | 0.012869 | 0.000341 | 0.000070 |
SVC | 0.038609 | 3.600720e-02 | 0.042708 | 0.032496 | 0.031965 | 0.028405 | 0.085485 | 0.073140 | 0.036968 | 0.026864 | 0.110736 | 0.000229 |
LogisticRegression | 0.068079 | 2.545484e-02 | 0.041183 | 0.027946 | 0.037992 | 0.024759 | 0.065948 | 0.021371 | 0.036585 | 0.022583 | 0.004623 | 0.000269 |
XGBClassifier | 0.065278 | 0.000000e+00 | 0.035553 | 0.000000 | 0.033315 | 0.000000 | 0.091826 | 0.000000 | 0.061108 | 0.000000 | 0.001688 | 0.000196 |
HistGradientBoostingClassifier | 0.059681 | 7.749323e-04 | 0.041183 | 0.007483 | 0.039938 | 0.011912 | 0.070291 | 0.005607 | 0.054859 | 0.007046 | 0.049279 | 0.005965 |
RandomForestClassifier | 0.060809 | 7.021667e-17 | 0.039192 | 0.000000 | 0.038392 | 0.000000 | 0.077307 | 0.000000 | 0.056132 | 0.000000 | 0.000342 | 0.000267 |
GaussianNB | 0.045845 | 2.494438e-02 | 0.042143 | 0.018303 | 0.037330 | 0.015830 | 0.031246 | 0.038051 | 0.025456 | 0.018727 | 0.000729 | 0.000126 |
get_estimator
is a convenience method that gets a pipeline from pipelines
by name, and optionally trains it on X
and y
.
PoniardBaseEstimator.get_estimator
PoniardBaseEstimator.get_estimator (estimator_name:str, include_preprocessor:bool=True, retrain:bool=False)
Obtain an estimator in pipelines
by name. This is useful for extracting default estimators or hyperparmeter-optimized estimators (after using PoniardBaseEstimator.tune_estimator
).
Type | Default | Details | |
---|---|---|---|
estimator_name | str | Estimator name. | |
include_preprocessor | bool | True | Whether to return a pipeline with a preprocessor or just the estimator. Default True. |
retrain | bool | False | Whether to retrain with full data. Default False. |
Returns | Union[Pipeline, ClassifierMixin, RegressorMixin] | Estimator. |
PoniardBaseEstimator.analyze_estimator
PoniardBaseEstimator.analyze_estimator (estimator_name:str, height:int=800, width:int=800)
Print a selection of metrics and plots for a given estimator.
By default, orders estimators according to the first metric.
Type | Default | Details | |
---|---|---|---|
estimator_name | str | Name of estimator to analyze. | |
height | int | 800 | Height of output Figure . |
width | int | 800 | Width of output Figure . |
Returns | Figure | Figure |
PoniardBaseEstimator.analyze_estimator
provides a quick overview of an estimator’s performance.
from sklearn.datasets import load_breast_cancer
from poniard import PoniardClassifier
= load_breast_cancer(return_X_y=True, as_frame=True)
X, y = PoniardClassifier().setup(X, y, show_info=False)
pnd pnd.fit()
PoniardClassifier()
"SVC", height=1000, width=1000) pnd.analyze_estimator(
Modifying estimators after initialization
Estimators can be added and removed directly. Note that if other estimators have been already fit, only the added ones will be processed during PoniardBaseEstimator.fit
.
PoniardBaseEstimator.add_estimators
PoniardBaseEstimator.add_estimators (estimators:Union[Dict[str,sklearn.ba se.ClassifierMixin],Sequence[sklearn .base.ClassifierMixin]])
Include new estimator. This is the recommended way of adding an estimator (as opposed to modifying pipelines
directly), since it also injects random state, n_jobs and verbosity.
Type | Details | |
---|---|---|
estimators | Union[Dict[str, ClassifierMixin], Sequence[ClassifierMixin]] | Estimators to add. |
Returns | PoniardBaseEstimator | Self. |
PoniardBaseEstimator.remove_estimators
PoniardBaseEstimator.remove_estimators (estimator_names:Sequence[str], drop_results:bool=True)
Remove estimators. This is the recommended way of removing an estimator (as opposed to modifying pipelines
directly), since it also removes the associated rows from the results tables.
Type | Default | Details | |
---|---|---|---|
estimator_names | Sequence[str] | Estimators to remove. | |
drop_results | bool | True | Whether to remove the results associated with the estimators. Default True. |
Returns | PoniardBaseEstimator | Self. |
pnd.add_estimators(ExtraTreesClassifier())"RandomForestClassifier")
pnd.remove_estimators(
pnd.fit() pnd.get_results()
test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
---|---|---|---|---|---|---|---|
LogisticRegression | 0.995456 | 0.978916 | 0.975411 | 0.991549 | 0.983351 | 0.007645 | 0.002424 |
SVC | 0.994139 | 0.975408 | 0.975111 | 0.985955 | 0.980477 | 0.008037 | 0.003919 |
HistGradientBoostingClassifier | 0.994128 | 0.970129 | 0.967263 | 0.985955 | 0.976433 | 0.539054 | 0.016192 |
XGBClassifier | 0.994123 | 0.970129 | 0.967554 | 0.985915 | 0.976469 | 0.049444 | 0.004278 |
ExtraTreesClassifier | 0.991055 | 0.968359 | 0.969925 | 0.980321 | 0.974955 | 0.042767 | 0.008918 |
GaussianNB | 0.988730 | 0.929700 | 0.940993 | 0.949413 | 0.944300 | 0.003169 | 0.004466 |
KNeighborsClassifier | 0.980610 | 0.964881 | 0.955018 | 0.991628 | 0.972746 | 0.002539 | 0.016843 |
DecisionTreeClassifier | 0.920983 | 0.926223 | 0.941672 | 0.941080 | 0.941054 | 0.005269 | 0.002359 |
DummyClassifier | 0.500000 | 0.627418 | 0.627418 | 1.000000 | 0.771052 | 0.001970 | 0.002919 |
pnd.pipelines.keys()
dict_keys(['LogisticRegression', 'GaussianNB', 'SVC', 'KNeighborsClassifier', 'DecisionTreeClassifier', 'HistGradientBoostingClassifier', 'XGBClassifier', 'DummyClassifier', 'ExtraTreesClassifier'])
Modifying the preprocessor after initializaiton
The preprocessor can be modified from within PoniardBaseEstimator
in two ways after PoniardBaseEstimator.setup
:
reassign_types
so that features are processed by other transformers, i.e., a numeric feature could be cast to a high cardinality categorical (for example, a store ID).add_preprocessing_step
adds a transformer or pipeline to the existingpreprocessor
.
See the Preprocessing guide for examples.
PoniardBaseEstimator.reassign_types
PoniardBaseEstimator.reassign_types (numeric:Optional[List[Union[str,int ]]]=None, categorical_high:Optional[ List[Union[str,int]]]=None, categori cal_low:Optional[List[Union[str,int] ]]=None, datetime:Optional[List[Unio n[str,int]]]=None, keep_remainder:bool=True)
Reassign feature types. By default, leaves ommitted features as they were.
Type | Default | Details | |
---|---|---|---|
numeric | Optional[List[Union[str, int]]] | None | List of column names or indices. Default None. |
categorical_high | Optional[List[Union[str, int]]] | None | List of column names or indices. Default None. |
categorical_low | Optional[List[Union[str, int]]] | None | List of column names or indices. Default None. |
datetime | Optional[List[Union[str, int]]] | None | List of column names or indices. Default None. |
keep_remainder | bool | True | Whether to keep features not specified in the method parameters as is or drop them |
Returns | PoniardBaseEstimator | self. |
PoniardBaseEstimator.add_preprocessing_step
PoniardBaseEstimator.add_preprocessing_step (step:Union[sklearn.pipeline. Pipeline,sklearn.base.Transf ormerMixin,sklearn.compose._ column_transformer.ColumnTra nsformer,Tuple[str,Union[skl earn.pipeline.Pipeline,sklea rn.base.TransformerMixin,skl earn.compose._column_transfo rmer.ColumnTransformer]]], p osition:Union[str,int]='end' )
Add a preprocessing step.
Type | Default | Details | |
---|---|---|---|
step | Union[Union[Pipeline, TransformerMixin, ColumnTransformer], Tuple[str, Union[Pipeline, TransformerMixin, ColumnTransformer]]] | A tuple of (str, transformer) or a scikit-learn transformer. Note that the transformer can also be a Pipeline or ColumnTransformer. |
|
position | Union[str, int] | end | Either an integer denoting before which step in the existing preprocessing pipeline the new step should be added, or ‘start’ or ‘end’. |
Returns | Pipeline | self |
Prediction methods
Cross validated predictions (using scikit-learn’s cross_val_predict
) can be obtained by calling the predict
, predict_proba
, decision_function
or predict_all
methods. Each of them takes an estimator_names
parameter that specifies which models should be used.
PoniardBaseEstimator.predict
PoniardBaseEstimator.predict (estimator_names:Optional[Sequence[str]]=No ne)
Get cross validated target predictions where each sample belongs to a single test set.
Type | Default | Details | |
---|---|---|---|
estimator_names | Optional[Sequence[str]] | None | Estimators to include. If None, predict all estimators. |
Returns | Dict[str, np.ndarray] | Dict where keys are estimator names and values are numpy arrays of predictions. |
PoniardBaseEstimator.predict_proba
PoniardBaseEstimator.predict_proba (estimator_names:Optional[Sequence[st r]]=None)
Get cross validated target probability predictions where each sample belongs to a single test set.
Type | Default | Details | |
---|---|---|---|
estimator_names | Optional[Sequence[str]] | None | |
Returns | Dict[str, np.ndarray] | Dict where keys are estimator names and values are numpy arrays of prediction probabilities. |
PoniardBaseEstimator.decision_function
PoniardBaseEstimator.decision_function (estimator_names:Optional[Sequenc e[str]]=None)
Get cross validated decision function predictions where each sample belongs to a single test set.
Type | Default | Details | |
---|---|---|---|
estimator_names | Optional[Sequence[str]] | None | Estimators to include. If None, predict all estimators. |
Returns | Dict[str, np.ndarray] | Dict where keys are estimator names and values are numpy arrays of decision functions. |
PoniardBaseEstimator.predict_all
PoniardBaseEstimator.predict_all (estimator_names:Optional[Sequence[str] ]=None)
Get cross validated target predictions, probabilities and decision functions where each sample belongs to a test set.
Type | Default | Details | |
---|---|---|---|
estimator_names | Optional[Sequence[str]] | None | Estimators to include. If None, predict all estimators. |
Returns | Tuple[Dict[str, np.ndarray]] | Tuple of dicts where keys are estimator names and values are numpy arrays of predictions. |
Ensembles and hyperparameter tuning
PoniardBaseEstimator.build_ensemble
PoniardBaseEstimator.build_ensemble (method:str='stacking', estimator_names:Optional[Sequence[st r]]=None, top_n:Optional[int]=3, sort_by:Optional[str]=None, ensemble_name:Optional[str]=None, **kwargs)
Combine estimators into an ensemble.
By default, orders estimators according to the first metric.
Type | Default | Details | |
---|---|---|---|
method | str | stacking | Ensemble method. Either “stacking” or “voting”. Default “stacking”. |
estimator_names | Optional[Sequence[str]] | None | Names of estimators to include. Default None, which uses top_n |
top_n | Optional[int] | 3 | How many of the best estimators to include. |
sort_by | Optional[str] | None | Which metric to consider for ordering results. Default None, which uses the first metric. |
ensemble_name | Optional[str] | None | Ensemble name when adding to pipelines . Default None. |
kwargs | Passed to the ensemble class constructor. | ||
Returns | PoniardBaseEstimator | Self. |
pnd.build_ensemble(="stacking",
method=["DecisionTreeClassifier", "KNeighborsClassifier", "SVC"],
estimator_names
)"StackingClassifier") pnd.get_estimator(
Pipeline(steps=[('preprocessor', Pipeline(steps=[('type_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])), ('remove_invariant', VarianceThreshold())], verbose=0)), ('StackingClassifier', StackingClassifier(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True), estimators=[('DecisionTreeClassifier', DecisionTreeClassifier(random_state=0)), ('KNeighborsClassifier', KNeighborsClassifier()), ('SVC', SVC(kernel='linear', probability=True, random_state=0, verbose=0))]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', Pipeline(steps=[('type_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])), ('remove_invariant', VarianceThreshold())], verbose=0)), ('StackingClassifier', StackingClassifier(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True), estimators=[('DecisionTreeClassifier', DecisionTreeClassifier(random_state=0)), ('KNeighborsClassifier', KNeighborsClassifier()), ('SVC', SVC(kernel='linear', probability=True, random_state=0, verbose=0))]))])
Pipeline(steps=[('type_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])), ('remove_invariant', VarianceThreshold())], verbose=0)
Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])
SimpleImputer()
StandardScaler()
VarianceThreshold()
StackingClassifier(cv=StratifiedKFold(n_splits=5, random_state=0, shuffle=True), estimators=[('DecisionTreeClassifier', DecisionTreeClassifier(random_state=0)), ('KNeighborsClassifier', KNeighborsClassifier()), ('SVC', SVC(kernel='linear', probability=True, random_state=0, verbose=0))])
DecisionTreeClassifier(random_state=0)
KNeighborsClassifier()
SVC(kernel='linear', probability=True, random_state=0, verbose=0)
LogisticRegression()
pnd.fit() pnd.get_results()
test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
---|---|---|---|---|---|---|---|
LogisticRegression | 0.995456 | 0.978916 | 0.975411 | 0.991549 | 0.983351 | 0.007645 | 0.002424 |
SVC | 0.994139 | 0.975408 | 0.975111 | 0.985955 | 0.980477 | 0.008037 | 0.003919 |
HistGradientBoostingClassifier | 0.994128 | 0.970129 | 0.967263 | 0.985955 | 0.976433 | 0.539054 | 0.016192 |
XGBClassifier | 0.994123 | 0.970129 | 0.967554 | 0.985915 | 0.976469 | 0.049444 | 0.004278 |
StackingClassifier | 0.993999 | 0.973653 | 0.967485 | 0.991588 | 0.979308 | 0.053218 | 0.005176 |
ExtraTreesClassifier | 0.991055 | 0.968359 | 0.969925 | 0.980321 | 0.974955 | 0.042767 | 0.008918 |
GaussianNB | 0.988730 | 0.929700 | 0.940993 | 0.949413 | 0.944300 | 0.003169 | 0.004466 |
KNeighborsClassifier | 0.980610 | 0.964881 | 0.955018 | 0.991628 | 0.972746 | 0.002539 | 0.016843 |
DecisionTreeClassifier | 0.920983 | 0.926223 | 0.941672 | 0.941080 | 0.941054 | 0.005269 | 0.002359 |
DummyClassifier | 0.500000 | 0.627418 | 0.627418 | 1.000000 | 0.771052 | 0.001970 | 0.002919 |
Use get_predictions_similarity
to compute how correlated the estimators’ predictions are. This can be useful for building ensembles with PoniardBaseEstimator.build_ensemble
.
PoniardBaseEstimator.get_predictions_similarity
PoniardBaseEstimator.get_predictions_similarity (on_errors:bool=True)
Compute correlation/association between cross validated predictions for each estimator.
This can be useful for ensembling.
Type | Default | Details | |
---|---|---|---|
on_errors | bool | True | Whether to compute similarity on prediction errors instead of predictions. Default True. |
Returns | pd.DataFrame | Similarity. |
pnd.get_predictions_similarity()
LogisticRegression | GaussianNB | SVC | KNeighborsClassifier | DecisionTreeClassifier | HistGradientBoostingClassifier | XGBClassifier | ExtraTreesClassifier | StackingClassifier | |
---|---|---|---|---|---|---|---|---|---|
LogisticRegression | 1.000000 | 0.315978 | 0.726194 | 0.401876 | 0.211925 | 0.367325 | 0.294833 | 0.426033 | 0.547327 |
GaussianNB | 0.315978 | 1.000000 | 0.331160 | 0.524911 | 0.354022 | 0.454955 | 0.495528 | 0.518582 | 0.489758 |
SVC | 0.726194 | 0.331160 | 1.000000 | 0.368042 | 0.277664 | 0.403438 | 0.336311 | 0.390700 | 0.574735 |
KNeighborsClassifier | 0.401876 | 0.524911 | 0.368042 | 1.000000 | 0.363762 | 0.497702 | 0.497702 | 0.482094 | 0.712582 |
DecisionTreeClassifier | 0.211925 | 0.354022 | 0.277664 | 0.363762 | 1.000000 | 0.362908 | 0.521706 | 0.427338 | 0.392178 |
HistGradientBoostingClassifier | 0.367325 | 0.454955 | 0.403438 | 0.497702 | 0.362908 | 1.000000 | 0.726570 | 0.645759 | 0.582252 |
XGBClassifier | 0.294833 | 0.495528 | 0.336311 | 0.497702 | 0.521706 | 0.726570 | 1.000000 | 0.704906 | 0.517572 |
ExtraTreesClassifier | 0.426033 | 0.518582 | 0.390700 | 0.482094 | 0.427338 | 0.645759 | 0.704906 | 1.000000 | 0.564618 |
StackingClassifier | 0.547327 | 0.489758 | 0.574735 | 0.712582 | 0.392178 | 0.582252 | 0.517572 | 0.564618 | 1.000000 |
Poniard offers light hyperparameter tuning through tune_estimator
, as well as hyperparameter grids for its default estimators. You are however free to specify whichever grid you want.
PoniardBaseEstimator.tune_estimator
PoniardBaseEstimator.tune_estimator (estimator_name:str, grid:Optional[Dict]=None, mode:str='grid', tuned_estimator_nam e:Optional[str]=None, **kwargs)
Hyperparameter tuning for a single estimator.
Type | Default | Details | |
---|---|---|---|
estimator_name | str | Estimator to tune. | |
grid | Optional[Dict] | None | Hyperparameter grid. Default None, which uses the grids available for default estimators. |
mode | str | grid | Type of search. Eitherr “grid”, “halving” or “random”. Default “grid”. |
tuned_estimator_name | Optional[str] | None | Estimator name when adding to pipelines . Default None. |
kwargs | Passed to the search class constructor. | ||
Returns | Union[GridSearchCV, RandomizedSearchCV] | Self. |