from poniard import PoniardRegressor
from sklearn.datasets import load_diabetesGetting started
Introduction
Essentially, a Poniard estimator is a set of scikit-learn estimators, a preprocessing strategy, a cross validation strategy and one or more metrics with which to score models.
The idea behind Poniard is to abstract away some of the boilerplate involved in fitting multiple models and comparing their cross validated results. However, a significant effort is made to keep everything flexible and as close to scikit-learn as possible.
Poniard includes a PoniardClassifier and a PoniardRegressor, aligned with scikit-learn classifiers and regressors.
Basic usage
In the following example we will load a toy dataset (sklearn’s diabetes dataset, a simple regression task) and have at it with default parameters.
X, y = load_diabetes(as_frame=True, return_X_y=True)
pnd = PoniardRegressor()
pnd.setup(X, y)Setup info
Target
Type: continuous
Shape: (442,)
Unique values: 214
Metrics
Main metric: neg_mean_squared_errorFeature type inference
Minimum unique values to consider a number-like feature numeric: 44
Minimum unique values to consider a categorical feature high cardinality: 20
Inferred feature types:
| numeric | categorical_high | categorical_low | datetime | |
|---|---|---|---|---|
| 0 | age | sex | ||
| 1 | bmi | |||
| 2 | bp | |||
| 3 | s1 | |||
| 4 | s2 | |||
| 5 | s3 | |||
| 6 | s4 | |||
| 7 | s5 | |||
| 8 | s6 |
PoniardRegressor()
Out of the box, you will get some useful information regarding the target variable and the features, as well information regarding current Poniard settings (first metric and thresholds). These are covered in detail later.
Once Poniard has parsed the data and built the preprocessing pipeline, we are free to run PoniardBaseEstimator.fit and PoniardBaseEstimator.get_results.
pnd.fit()
pnd.get_results()| test_neg_mean_squared_error | test_neg_mean_absolute_percentage_error | test_neg_median_absolute_error | test_r2 | fit_time | score_time | |
|---|---|---|---|---|---|---|
| LinearRegression | -2977.598515 | -0.396566 | -39.009146 | 0.489155 | 0.005163 | 0.002297 |
| ElasticNet | -3159.017211 | -0.422912 | -42.619546 | 0.460740 | 0.003851 | 0.002356 |
| RandomForestRegressor | -3431.823331 | -0.419956 | -42.203000 | 0.414595 | 0.101286 | 0.004822 |
| HistGradientBoostingRegressor | -3544.069433 | -0.407417 | -40.396390 | 0.391633 | 0.279463 | 0.007446 |
| KNeighborsRegressor | -3615.195398 | -0.418674 | -38.980000 | 0.379625 | 0.003590 | 0.002260 |
| XGBRegressor | -3923.488860 | -0.426471 | -39.031309 | 0.329961 | 0.056158 | 0.002874 |
| LinearSVR | -4268.314411 | -0.374296 | -43.388592 | 0.271443 | 0.004315 | 0.002116 |
| DummyRegressor | -5934.577616 | -0.621540 | -61.775921 | -0.000797 | 0.003002 | 0.001690 |
| DecisionTreeRegressor | -6728.423034 | -0.591906 | -59.700000 | -0.145460 | 0.004800 | 0.001748 |
In those two lines, 9 different regression models were trained with cross validation and the average score for multiple metrics was printed.
Poniard always include a DummyClassifier with strategy="prior" or DummyRegressor with strategy="mean" in order to have the absolute minimum baseline scores. Models should easily beat these, but you could be surprised.
Poniard tries to provide good defaults everywhere.
estimators: sklearn provides more than 40 classifiers and 50 regressors, but for most problems you can make do with a limited list of the most battle tested models. Poniard reflects that.metrics: different metrics capture different aspects of the relationship between predictions and ground truth, so Poniard includes multiple suitable ones.cv: cross validation is a key aspect of the Poniard flow, and by default 5-fold validation is used.
random_seed behavior
Poniard estimators’ random_seed parameter is always set (if random_seed=None at initialization, it will be forced to 0) and injected into models and cross validators. The idea is to get a reproducible environment, including using the same cross validation folds for each model.
Default preprocessing deserves its own mention. By default, type inference will be run on the datasets’ features and transformations will be applied accordingly, which is handled by a PoniardPreprocessor built inside PoniardBaseEstimator. The end goal of the default preprocessor is to make models run without raising any errors.
As with most things in Poniard, the preprocessing pipeline can be modified (by passing a custom PoniardPreprocessor) or replaced entirely with the scikit-learn transformers you are used to.
pnd.preprocessorPipeline(steps=[('type_preprocessor',
ColumnTransformer(transformers=[('numeric_preprocessor',
Pipeline(steps=[('numeric_imputer',
SimpleImputer()),
('scaler',
StandardScaler())]),
['age', 'bmi', 'bp', 's1',
's2', 's3', 's4', 's5',
's6']),
('categorical_low_preprocessor',
Pipeline(steps=[('categorical_imputer',
SimpleImputer(strategy='most_frequent')),
('one-hot_encoder',
OneHotEncoder(drop='if_binary',
handle_unknown='ignore',
sparse=False))]),
['sex'])])),
('remove_invariant', VarianceThreshold())],
verbose=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('type_preprocessor',
ColumnTransformer(transformers=[('numeric_preprocessor',
Pipeline(steps=[('numeric_imputer',
SimpleImputer()),
('scaler',
StandardScaler())]),
['age', 'bmi', 'bp', 's1',
's2', 's3', 's4', 's5',
's6']),
('categorical_low_preprocessor',
Pipeline(steps=[('categorical_imputer',
SimpleImputer(strategy='most_frequent')),
('one-hot_encoder',
OneHotEncoder(drop='if_binary',
handle_unknown='ignore',
sparse=False))]),
['sex'])])),
('remove_invariant', VarianceThreshold())],
verbose=0)ColumnTransformer(transformers=[('numeric_preprocessor',
Pipeline(steps=[('numeric_imputer',
SimpleImputer()),
('scaler', StandardScaler())]),
['age', 'bmi', 'bp', 's1', 's2', 's3', 's4',
's5', 's6']),
('categorical_low_preprocessor',
Pipeline(steps=[('categorical_imputer',
SimpleImputer(strategy='most_frequent')),
('one-hot_encoder',
OneHotEncoder(drop='if_binary',
handle_unknown='ignore',
sparse=False))]),
['sex'])])['age', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
SimpleImputer()
StandardScaler()
['sex']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False)
VarianceThreshold()
Poniard keeps track of which models it has cross validated, which means that
- If a new one is added, it will not fit the existing ones.
- If the preprocessor is changed after training, it will fit everything again.
from sklearn.linear_model import SGDRegressorpnd.add_estimators(SGDRegressor(max_iter=10000))
pnd.fit()
pnd.get_results()| test_neg_mean_squared_error | test_neg_mean_absolute_percentage_error | test_neg_median_absolute_error | test_r2 | fit_time | score_time | |
|---|---|---|---|---|---|---|
| LinearRegression | -2977.598515 | -0.396566 | -39.009146 | 0.489155 | 0.005163 | 0.002297 |
| SGDRegressor | -2984.789764 | -0.396191 | -40.013179 | 0.487430 | 0.004640 | 0.001782 |
| ElasticNet | -3159.017211 | -0.422912 | -42.619546 | 0.460740 | 0.003851 | 0.002356 |
| RandomForestRegressor | -3431.823331 | -0.419956 | -42.203000 | 0.414595 | 0.101286 | 0.004822 |
| HistGradientBoostingRegressor | -3544.069433 | -0.407417 | -40.396390 | 0.391633 | 0.279463 | 0.007446 |
| KNeighborsRegressor | -3615.195398 | -0.418674 | -38.980000 | 0.379625 | 0.003590 | 0.002260 |
| XGBRegressor | -3923.488860 | -0.426471 | -39.031309 | 0.329961 | 0.056158 | 0.002874 |
| LinearSVR | -4268.314411 | -0.374296 | -43.388592 | 0.271443 | 0.004315 | 0.002116 |
| DummyRegressor | -5934.577616 | -0.621540 | -61.775921 | -0.000797 | 0.003002 | 0.001690 |
| DecisionTreeRegressor | -6728.423034 | -0.591906 | -59.700000 | -0.145460 | 0.004800 | 0.001748 |
Anywhere Poniard takes estimators or metrics, or strings representing them, a single element or a sequence of elements can be safely passed and will be handed gracefully.
A quick view of an estimator is available through PoniardBaseEstimator.analyze_estimator.
pnd.analyze_estimator("SGDRegressor", height=1000, width=800)Plots: when get_results is not enough
While a nicely formatted table is useful, graphical aid can make things go a lot smoother. Poniard estimators include a plot accessor that gives access to multiple prebuilt plots.
Check the plotting reference for a deeper dive.
fig = pnd.plot.metrics(
kind="bar", metrics=["neg_mean_absolute_percentage_error", "neg_mean_squared_error"]
)
fig.show("notebook")fig = pnd.plot.residuals_histogram(estimator_names=["LinearRegression", "SGDRegressor"])
fig.show("notebook")By using Plotly, users can have an easier time exploring charts, zooming in, selecting specific models, etc.
A reasonably unified API
So far we have analyzed a regression task. Luckily, PoniardClassifier and PoniardRegressor differ only in default models, default cross validation strategy and default metrics.
from poniard import PoniardClassifier
from sklearn.datasets import load_wineX, y = load_wine(return_X_y=True, as_frame=True)
clf = PoniardClassifier().setup(X, y, show_info=False)clf.fit()
clf.get_results()| test_roc_auc_ovr | test_accuracy | test_precision_macro | test_recall_macro | test_f1_macro | fit_time | score_time | |
|---|---|---|---|---|---|---|---|
| LogisticRegression | 1.000000 | 0.983175 | 0.982828 | 0.983810 | 0.982571 | 0.004461 | 0.003472 |
| RandomForestClassifier | 0.999336 | 0.971905 | 0.973216 | 0.974098 | 0.972726 | 0.044476 | 0.007773 |
| HistGradientBoostingClassifier | 0.999311 | 0.971746 | 0.970350 | 0.976508 | 0.972109 | 0.268834 | 0.025884 |
| SVC | 0.999128 | 0.960635 | 0.960133 | 0.965079 | 0.960681 | 0.002355 | 0.002526 |
| GaussianNB | 0.998855 | 0.971905 | 0.973533 | 0.974098 | 0.972720 | 0.002148 | 0.002939 |
| XGBClassifier | 0.998213 | 0.949206 | 0.956410 | 0.950548 | 0.950512 | 0.020969 | 0.003620 |
| KNeighborsClassifier | 0.995903 | 0.960794 | 0.959845 | 0.965079 | 0.960468 | 0.001462 | 0.003054 |
| DecisionTreeClassifier | 0.945058 | 0.927302 | 0.933483 | 0.927961 | 0.928931 | 0.001737 | 0.002261 |
| DummyClassifier | 0.500000 | 0.399048 | 0.133016 | 0.333333 | 0.190095 | 0.001500 | 0.002517 |
fig = clf.plot.confusion_matrix(estimator_name="LogisticRegression")
fig.show("notebook")Intended use
In the real world where real data lives, building machine learning models is not as simple as running an abstraction in 2 lines and calling it a day.
Our preferred way of working is splitting the data in train-test sets before touching Poniard. That way you can pass the training data to PoniardBaseEstimator.setup and let the inbuilt cross validation handle model evaluation.
When you are done, PoniardBaseEstimator.get_estimator simply returns a pipeline by name and optionally trains it with full data (which should be just training data), making it easy to continue working on models while preserving an unseen test set which can now be used to assess generalization power.
clf.get_estimator("RandomForestClassifier", retrain=False)Pipeline(steps=[('preprocessor',
Pipeline(steps=[('type_preprocessor',
Pipeline(steps=[('numeric_imputer',
SimpleImputer()),
('scaler',
StandardScaler())])),
('remove_invariant', VarianceThreshold())],
verbose=0)),
('RandomForestClassifier',
RandomForestClassifier(random_state=0))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
Pipeline(steps=[('type_preprocessor',
Pipeline(steps=[('numeric_imputer',
SimpleImputer()),
('scaler',
StandardScaler())])),
('remove_invariant', VarianceThreshold())],
verbose=0)),
('RandomForestClassifier',
RandomForestClassifier(random_state=0))])Pipeline(steps=[('type_preprocessor',
Pipeline(steps=[('numeric_imputer', SimpleImputer()),
('scaler', StandardScaler())])),
('remove_invariant', VarianceThreshold())],
verbose=0)Pipeline(steps=[('numeric_imputer', SimpleImputer()),
('scaler', StandardScaler())])SimpleImputer()
StandardScaler()
VarianceThreshold()
RandomForestClassifier(random_state=0)