from poniard import PoniardRegressor
from sklearn.datasets import load_diabetes
Getting started
Introduction
Essentially, a Poniard estimator is a set of scikit-learn estimators, a preprocessing strategy, a cross validation strategy and one or more metrics with which to score models.
The idea behind Poniard is to abstract away some of the boilerplate involved in fitting multiple models and comparing their cross validated results. However, a significant effort is made to keep everything flexible and as close to scikit-learn as possible.
Poniard includes a PoniardClassifier
and a PoniardRegressor
, aligned with scikit-learn classifiers and regressors.
Basic usage
In the following example we will load a toy dataset (sklearn’s diabetes dataset, a simple regression task) and have at it with default parameters.
= load_diabetes(as_frame=True, return_X_y=True)
X, y = PoniardRegressor()
pnd pnd.setup(X, y)
Setup info
Target
Type: continuous
Shape: (442,)
Unique values: 214
Metrics
Main metric: neg_mean_squared_errorFeature type inference
Minimum unique values to consider a number-like feature numeric: 44
Minimum unique values to consider a categorical feature high cardinality: 20
Inferred feature types:
numeric | categorical_high | categorical_low | datetime | |
---|---|---|---|---|
0 | age | sex | ||
1 | bmi | |||
2 | bp | |||
3 | s1 | |||
4 | s2 | |||
5 | s3 | |||
6 | s4 | |||
7 | s5 | |||
8 | s6 |
PoniardRegressor()
Out of the box, you will get some useful information regarding the target variable and the features, as well information regarding current Poniard settings (first metric and thresholds). These are covered in detail later.
Once Poniard has parsed the data and built the preprocessing pipeline, we are free to run PoniardBaseEstimator.fit
and PoniardBaseEstimator.get_results
.
pnd.fit() pnd.get_results()
test_neg_mean_squared_error | test_neg_mean_absolute_percentage_error | test_neg_median_absolute_error | test_r2 | fit_time | score_time | |
---|---|---|---|---|---|---|
LinearRegression | -2977.598515 | -0.396566 | -39.009146 | 0.489155 | 0.005163 | 0.002297 |
ElasticNet | -3159.017211 | -0.422912 | -42.619546 | 0.460740 | 0.003851 | 0.002356 |
RandomForestRegressor | -3431.823331 | -0.419956 | -42.203000 | 0.414595 | 0.101286 | 0.004822 |
HistGradientBoostingRegressor | -3544.069433 | -0.407417 | -40.396390 | 0.391633 | 0.279463 | 0.007446 |
KNeighborsRegressor | -3615.195398 | -0.418674 | -38.980000 | 0.379625 | 0.003590 | 0.002260 |
XGBRegressor | -3923.488860 | -0.426471 | -39.031309 | 0.329961 | 0.056158 | 0.002874 |
LinearSVR | -4268.314411 | -0.374296 | -43.388592 | 0.271443 | 0.004315 | 0.002116 |
DummyRegressor | -5934.577616 | -0.621540 | -61.775921 | -0.000797 | 0.003002 | 0.001690 |
DecisionTreeRegressor | -6728.423034 | -0.591906 | -59.700000 | -0.145460 | 0.004800 | 0.001748 |
In those two lines, 9 different regression models were trained with cross validation and the average score for multiple metrics was printed.
Poniard tries to provide good defaults everywhere.
estimators
: sklearn provides more than 40 classifiers and 50 regressors, but for most problems you can make do with a limited list of the most battle tested models. Poniard reflects that.metrics
: different metrics capture different aspects of the relationship between predictions and ground truth, so Poniard includes multiple suitable ones.cv
: cross validation is a key aspect of the Poniard flow, and by default 5-fold validation is used.
Default preprocessing deserves its own mention. By default, type inference will be run on the datasets’ features and transformations will be applied accordingly, which is handled by a PoniardPreprocessor
built inside PoniardBaseEstimator
. The end goal of the default preprocessor is to make models run without raising any errors.
As with most things in Poniard, the preprocessing pipeline can be modified (by passing a custom PoniardPreprocessor
) or replaced entirely with the scikit-learn transformers you are used to.
pnd.preprocessor
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False))]), ['sex'])])), ('remove_invariant', VarianceThreshold())], verbose=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False))]), ['sex'])])), ('remove_invariant', VarianceThreshold())], verbose=0)
ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False))]), ['sex'])])
['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
SimpleImputer()
StandardScaler()
['sex']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False)
VarianceThreshold()
Poniard keeps track of which models it has cross validated, which means that
- If a new one is added, it will not fit the existing ones.
- If the preprocessor is changed after training, it will fit everything again.
from sklearn.linear_model import SGDRegressor
=10000))
pnd.add_estimators(SGDRegressor(max_iter
pnd.fit() pnd.get_results()
test_neg_mean_squared_error | test_neg_mean_absolute_percentage_error | test_neg_median_absolute_error | test_r2 | fit_time | score_time | |
---|---|---|---|---|---|---|
LinearRegression | -2977.598515 | -0.396566 | -39.009146 | 0.489155 | 0.005163 | 0.002297 |
SGDRegressor | -2984.789764 | -0.396191 | -40.013179 | 0.487430 | 0.004640 | 0.001782 |
ElasticNet | -3159.017211 | -0.422912 | -42.619546 | 0.460740 | 0.003851 | 0.002356 |
RandomForestRegressor | -3431.823331 | -0.419956 | -42.203000 | 0.414595 | 0.101286 | 0.004822 |
HistGradientBoostingRegressor | -3544.069433 | -0.407417 | -40.396390 | 0.391633 | 0.279463 | 0.007446 |
KNeighborsRegressor | -3615.195398 | -0.418674 | -38.980000 | 0.379625 | 0.003590 | 0.002260 |
XGBRegressor | -3923.488860 | -0.426471 | -39.031309 | 0.329961 | 0.056158 | 0.002874 |
LinearSVR | -4268.314411 | -0.374296 | -43.388592 | 0.271443 | 0.004315 | 0.002116 |
DummyRegressor | -5934.577616 | -0.621540 | -61.775921 | -0.000797 | 0.003002 | 0.001690 |
DecisionTreeRegressor | -6728.423034 | -0.591906 | -59.700000 | -0.145460 | 0.004800 | 0.001748 |
A quick view of an estimator is available through PoniardBaseEstimator.analyze_estimator
.
"SGDRegressor", height=1000, width=800) pnd.analyze_estimator(
Plots: when get_results
is not enough
While a nicely formatted table is useful, graphical aid can make things go a lot smoother. Poniard estimators include a plot
accessor that gives access to multiple prebuilt plots.
Check the plotting reference for a deeper dive.
= pnd.plot.metrics(
fig ="bar", metrics=["neg_mean_absolute_percentage_error", "neg_mean_squared_error"]
kind
)"notebook") fig.show(
= pnd.plot.residuals_histogram(estimator_names=["LinearRegression", "SGDRegressor"])
fig "notebook") fig.show(
By using Plotly, users can have an easier time exploring charts, zooming in, selecting specific models, etc.
A reasonably unified API
So far we have analyzed a regression task. Luckily, PoniardClassifier
and PoniardRegressor
differ only in default models, default cross validation strategy and default metrics.
from poniard import PoniardClassifier
from sklearn.datasets import load_wine
= load_wine(return_X_y=True, as_frame=True)
X, y = PoniardClassifier().setup(X, y, show_info=False) clf
clf.fit() clf.get_results()
test_roc_auc_ovr | test_accuracy | test_precision_macro | test_recall_macro | test_f1_macro | fit_time | score_time | |
---|---|---|---|---|---|---|---|
LogisticRegression | 1.000000 | 0.983175 | 0.982828 | 0.983810 | 0.982571 | 0.004461 | 0.003472 |
RandomForestClassifier | 0.999336 | 0.971905 | 0.973216 | 0.974098 | 0.972726 | 0.044476 | 0.007773 |
HistGradientBoostingClassifier | 0.999311 | 0.971746 | 0.970350 | 0.976508 | 0.972109 | 0.268834 | 0.025884 |
SVC | 0.999128 | 0.960635 | 0.960133 | 0.965079 | 0.960681 | 0.002355 | 0.002526 |
GaussianNB | 0.998855 | 0.971905 | 0.973533 | 0.974098 | 0.972720 | 0.002148 | 0.002939 |
XGBClassifier | 0.998213 | 0.949206 | 0.956410 | 0.950548 | 0.950512 | 0.020969 | 0.003620 |
KNeighborsClassifier | 0.995903 | 0.960794 | 0.959845 | 0.965079 | 0.960468 | 0.001462 | 0.003054 |
DecisionTreeClassifier | 0.945058 | 0.927302 | 0.933483 | 0.927961 | 0.928931 | 0.001737 | 0.002261 |
DummyClassifier | 0.500000 | 0.399048 | 0.133016 | 0.333333 | 0.190095 | 0.001500 | 0.002517 |
= clf.plot.confusion_matrix(estimator_name="LogisticRegression")
fig "notebook") fig.show(
Intended use
In the real world where real data lives, building machine learning models is not as simple as running an abstraction in 2 lines and calling it a day.
Our preferred way of working is splitting the data in train-test sets before touching Poniard. That way you can pass the training data to PoniardBaseEstimator.setup
and let the inbuilt cross validation handle model evaluation.
When you are done, PoniardBaseEstimator.get_estimator
simply returns a pipeline by name and optionally trains it with full data (which should be just training data), making it easy to continue working on models while preserving an unseen test set which can now be used to assess generalization power.
"RandomForestClassifier", retrain=False) clf.get_estimator(
Pipeline(steps=[('preprocessor', Pipeline(steps=[('type_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])), ('remove_invariant', VarianceThreshold())], verbose=0)), ('RandomForestClassifier', RandomForestClassifier(random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', Pipeline(steps=[('type_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])), ('remove_invariant', VarianceThreshold())], verbose=0)), ('RandomForestClassifier', RandomForestClassifier(random_state=0))])
Pipeline(steps=[('type_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])), ('remove_invariant', VarianceThreshold())], verbose=0)
Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())])
SimpleImputer()
StandardScaler()
VarianceThreshold()
RandomForestClassifier(random_state=0)