Getting started

Understand Poniard’s main functionality and how to get up and running quickly.

Introduction

Essentially, a Poniard estimator is a set of scikit-learn estimators, a preprocessing strategy, a cross validation strategy and one or more metrics with which to score models.

The idea behind Poniard is to abstract away some of the boilerplate involved in fitting multiple models and comparing their cross validated results. However, a significant effort is made to keep everything flexible and as close to scikit-learn as possible.

Poniard includes a PoniardClassifier and a PoniardRegressor, aligned with scikit-learn classifiers and regressors.

Basic usage

In the following example we will load a toy dataset (sklearn’s diabetes dataset, a simple regression task) and have at it with default parameters.

from poniard import PoniardRegressor
from sklearn.datasets import load_diabetes
X, y = load_diabetes(as_frame=True, return_X_y=True)
pnd = PoniardRegressor()
pnd.setup(X, y)

Setup info

Target

Type: continuous

Shape: (442,)

Unique values: 214

Metrics

Main metric: neg_mean_squared_error

Feature type inference

Minimum unique values to consider a number-like feature numeric: 44

Minimum unique values to consider a categorical feature high cardinality: 20

Inferred feature types:

numeric categorical_high categorical_low datetime
0 age sex
1 bmi
2 bp
3 s1
4 s2
5 s3
6 s4
7 s5
8 s6
PoniardRegressor()

Out of the box, you will get some useful information regarding the target variable and the features, as well information regarding current Poniard settings (first metric and thresholds). These are covered in detail later.

Once Poniard has parsed the data and built the preprocessing pipeline, we are free to run PoniardBaseEstimator.fit and PoniardBaseEstimator.get_results.

pnd.fit()
pnd.get_results()
test_neg_mean_squared_error test_neg_mean_absolute_percentage_error test_neg_median_absolute_error test_r2 fit_time score_time
LinearRegression -2977.598515 -0.396566 -39.009146 0.489155 0.005163 0.002297
ElasticNet -3159.017211 -0.422912 -42.619546 0.460740 0.003851 0.002356
RandomForestRegressor -3431.823331 -0.419956 -42.203000 0.414595 0.101286 0.004822
HistGradientBoostingRegressor -3544.069433 -0.407417 -40.396390 0.391633 0.279463 0.007446
KNeighborsRegressor -3615.195398 -0.418674 -38.980000 0.379625 0.003590 0.002260
XGBRegressor -3923.488860 -0.426471 -39.031309 0.329961 0.056158 0.002874
LinearSVR -4268.314411 -0.374296 -43.388592 0.271443 0.004315 0.002116
DummyRegressor -5934.577616 -0.621540 -61.775921 -0.000797 0.003002 0.001690
DecisionTreeRegressor -6728.423034 -0.591906 -59.700000 -0.145460 0.004800 0.001748

In those two lines, 9 different regression models were trained with cross validation and the average score for multiple metrics was printed.

Dummy estimators

Poniard always include a DummyClassifier with strategy="prior" or DummyRegressor with strategy="mean" in order to have the absolute minimum baseline scores. Models should easily beat these, but you could be surprised.

Poniard tries to provide good defaults everywhere.

  • estimators: sklearn provides more than 40 classifiers and 50 regressors, but for most problems you can make do with a limited list of the most battle tested models. Poniard reflects that.
  • metrics: different metrics capture different aspects of the relationship between predictions and ground truth, so Poniard includes multiple suitable ones.
  • cv: cross validation is a key aspect of the Poniard flow, and by default 5-fold validation is used.
random_seed behavior

Poniard estimators’ random_seed parameter is always set (if random_seed=None at initialization, it will be forced to 0) and injected into models and cross validators. The idea is to get a reproducible environment, including using the same cross validation folds for each model.

Default preprocessing deserves its own mention. By default, type inference will be run on the datasets’ features and transformations will be applied accordingly, which is handled by a PoniardPreprocessor built inside PoniardBaseEstimator. The end goal of the default preprocessor is to make models run without raising any errors.

As with most things in Poniard, the preprocessing pipeline can be modified (by passing a custom PoniardPreprocessor) or replaced entirely with the scikit-learn transformers you are used to.

pnd.preprocessor
Pipeline(steps=[('type_preprocessor',
                 ColumnTransformer(transformers=[('numeric_preprocessor',
                                                  Pipeline(steps=[('numeric_imputer',
                                                                   SimpleImputer()),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  ['age', 'bmi', 'bp', 's1',
                                                   's2', 's3', 's4', 's5',
                                                   's6']),
                                                 ('categorical_low_preprocessor',
                                                  Pipeline(steps=[('categorical_imputer',
                                                                   SimpleImputer(strategy='most_frequent')),
                                                                  ('one-hot_encoder',
                                                                   OneHotEncoder(drop='if_binary',
                                                                                 handle_unknown='ignore',
                                                                                 sparse=False))]),
                                                  ['sex'])])),
                ('remove_invariant', VarianceThreshold())],
         verbose=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Poniard keeps track of which models it has cross validated, which means that

  1. If a new one is added, it will not fit the existing ones.
  2. If the preprocessor is changed after training, it will fit everything again.
from sklearn.linear_model import SGDRegressor
pnd.add_estimators(SGDRegressor(max_iter=10000))
pnd.fit()
pnd.get_results()
test_neg_mean_squared_error test_neg_mean_absolute_percentage_error test_neg_median_absolute_error test_r2 fit_time score_time
LinearRegression -2977.598515 -0.396566 -39.009146 0.489155 0.005163 0.002297
SGDRegressor -2984.789764 -0.396191 -40.013179 0.487430 0.004640 0.001782
ElasticNet -3159.017211 -0.422912 -42.619546 0.460740 0.003851 0.002356
RandomForestRegressor -3431.823331 -0.419956 -42.203000 0.414595 0.101286 0.004822
HistGradientBoostingRegressor -3544.069433 -0.407417 -40.396390 0.391633 0.279463 0.007446
KNeighborsRegressor -3615.195398 -0.418674 -38.980000 0.379625 0.003590 0.002260
XGBRegressor -3923.488860 -0.426471 -39.031309 0.329961 0.056158 0.002874
LinearSVR -4268.314411 -0.374296 -43.388592 0.271443 0.004315 0.002116
DummyRegressor -5934.577616 -0.621540 -61.775921 -0.000797 0.003002 0.001690
DecisionTreeRegressor -6728.423034 -0.591906 -59.700000 -0.145460 0.004800 0.001748
Single or array-like inputs

Anywhere Poniard takes estimators or metrics, or strings representing them, a single element or a sequence of elements can be safely passed and will be handed gracefully.

A quick view of an estimator is available through PoniardBaseEstimator.analyze_estimator.

pnd.analyze_estimator("SGDRegressor", height=1000, width=800)

Plots: when get_results is not enough

While a nicely formatted table is useful, graphical aid can make things go a lot smoother. Poniard estimators include a plot accessor that gives access to multiple prebuilt plots.

Check the plotting reference for a deeper dive.

fig = pnd.plot.metrics(
    kind="bar", metrics=["neg_mean_absolute_percentage_error", "neg_mean_squared_error"]
)
fig.show("notebook")
fig = pnd.plot.residuals_histogram(estimator_names=["LinearRegression", "SGDRegressor"])
fig.show("notebook")

By using Plotly, users can have an easier time exploring charts, zooming in, selecting specific models, etc.

A reasonably unified API

So far we have analyzed a regression task. Luckily, PoniardClassifier and PoniardRegressor differ only in default models, default cross validation strategy and default metrics.

from poniard import PoniardClassifier
from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True, as_frame=True)
clf = PoniardClassifier().setup(X, y, show_info=False)
clf.fit()
clf.get_results()
test_roc_auc_ovr test_accuracy test_precision_macro test_recall_macro test_f1_macro fit_time score_time
LogisticRegression 1.000000 0.983175 0.982828 0.983810 0.982571 0.004461 0.003472
RandomForestClassifier 0.999336 0.971905 0.973216 0.974098 0.972726 0.044476 0.007773
HistGradientBoostingClassifier 0.999311 0.971746 0.970350 0.976508 0.972109 0.268834 0.025884
SVC 0.999128 0.960635 0.960133 0.965079 0.960681 0.002355 0.002526
GaussianNB 0.998855 0.971905 0.973533 0.974098 0.972720 0.002148 0.002939
XGBClassifier 0.998213 0.949206 0.956410 0.950548 0.950512 0.020969 0.003620
KNeighborsClassifier 0.995903 0.960794 0.959845 0.965079 0.960468 0.001462 0.003054
DecisionTreeClassifier 0.945058 0.927302 0.933483 0.927961 0.928931 0.001737 0.002261
DummyClassifier 0.500000 0.399048 0.133016 0.333333 0.190095 0.001500 0.002517
fig = clf.plot.confusion_matrix(estimator_name="LogisticRegression")
fig.show("notebook")

Intended use

In the real world where real data lives, building machine learning models is not as simple as running an abstraction in 2 lines and calling it a day.

Our preferred way of working is splitting the data in train-test sets before touching Poniard. That way you can pass the training data to PoniardBaseEstimator.setup and let the inbuilt cross validation handle model evaluation.

When you are done, PoniardBaseEstimator.get_estimator simply returns a pipeline by name and optionally trains it with full data (which should be just training data), making it easy to continue working on models while preserving an unseen test set which can now be used to assess generalization power.

clf.get_estimator("RandomForestClassifier", retrain=False)
Pipeline(steps=[('preprocessor',
                 Pipeline(steps=[('type_preprocessor',
                                  Pipeline(steps=[('numeric_imputer',
                                                   SimpleImputer()),
                                                  ('scaler',
                                                   StandardScaler())])),
                                 ('remove_invariant', VarianceThreshold())],
                          verbose=0)),
                ('RandomForestClassifier',
                 RandomForestClassifier(random_state=0))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.