Main parameters

Learn how the main Poniard parameters work.

Introduction

At the core of Poniard lie the choice of estimators, metrics and CV strategy. While defaults might work for most cases, we try to keep it flexible.

estimators

Estimators can be passed as a dict of estimator_name: estimator_instance, as a sequence of estimator_instance or as a single estimator. When not specifying names, they will be obtained directly from the class.

Using a dictionary allows passing multiple instances of the same estimator with different parameters.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from poniard import PoniardClassifier
X, y = make_classification(n_classes=3, n_features=5, n_informative=3)
pnd = PoniardClassifier(
    estimators={
        "lr": LogisticRegression(max_iter=5000),
        "lr_no_penalty": LogisticRegression(max_iter=5000, penalty="none"),
        "lda": LinearDiscriminantAnalysis(),
    }
)
pnd.setup(X, y)
pnd.fit()

Setup info

Target

Type: multiclass

Shape: (100,)

Unique values: 3

Metrics

Main metric: roc_auc_ovr

Feature type inference

Minimum unique values to consider a number-like feature numeric: 10

Minimum unique values to consider a categorical feature high cardinality: 20

Inferred feature types:

numeric categorical_high categorical_low datetime
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
PoniardClassifier(estimators={'lr': LogisticRegression(max_iter=5000, random_state=0), 'lr_no_penalty': LogisticRegression(max_iter=5000, penalty='none', random_state=0), 'lda': LinearDiscriminantAnalysis()})

Since we are in scikit-learn-land, most of the stuff you expect to work still works. For example, multilabel classification.

Here we had to use a dictionary because estimator.__class__.__name__, which is used for assigning a name to each estimator when a list is passed, would be the same for both OneVsRestClassifier and they would be overwritten.

from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
X, y = make_multilabel_classification(n_samples=1000, n_features=6)
pnd = PoniardClassifier(
    estimators={
        "rf": OneVsRestClassifier(RandomForestClassifier()),
        "lr": OneVsRestClassifier(LogisticRegression(max_iter=5000)),
    }
)
pnd.setup(X, y, show_info=False)
pnd.fit()
/Users/rafxavier/Documents/Repos/personal/poniard/poniard/preprocessing/core.py:145: UserWarning: TargetEncoder is not supported for multilabel or multioutput targets. Switching to OrdinalEncoder.
  ) = self._setup_transformers()
PoniardClassifier(estimators={'rf': OneVsRestClassifier(estimator=RandomForestClassifier()), 'lr': OneVsRestClassifier(estimator=LogisticRegression(max_iter=5000))})
pnd.get_results()
test_roc_auc test_accuracy test_precision_macro test_recall_macro test_f1_macro fit_time score_time
rf 0.841913 0.404 0.718702 0.592022 0.635481 0.568812 0.048382
lr 0.801743 0.320 0.651472 0.532183 0.570917 0.142863 0.009468
DummyClassifier 0.500000 0.093 0.194000 0.360000 0.251713 0.004363 0.006814

metrics

Metrics can be passed as a list of strings such as "accuracy" or "neg_mean_squared_error", following the familiar scikit-learn nomenclature, or as a dict of str: Callable. For convenience, it can also be a single string.

However, in a departure from scikit-learn, metrics will fail if a Callable is passed directly. This restriction is in place to facilitate naming columns in the PoniardBaseEstimator.get_results method.

scoring vs. metrics parameters

In scikit-learn parlance, a metric is a measure of the prediction error of a model. Scoring, which is used in sklearn model evaluation objects (like GridSearchCV or cross_validate), reflects the same, but with the restriction that higher values are better than lower values. Poniard uses the parameter name metrics for now, but will eventually migrate to scoring as that reflects more accurately its meaning.

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from poniard import PoniardRegressor
X, y = make_regression(n_samples=500, n_features=10, n_informative=5)
pnd = PoniardRegressor(
    metrics=["explained_variance", "neg_median_absolute_error"],
    estimators=[LinearRegression()],
)
pnd.setup(X, y, show_info=False)
pnd.fit()
pnd.get_results(return_train_scores=True)
test_explained_variance train_explained_variance test_neg_median_absolute_error train_neg_median_absolute_error fit_time score_time
LinearRegression 1.0 1.000000e+00 -8.366641e-14 -8.038015e-14 0.001400 0.000375
DummyRegressor 0.0 4.440892e-17 -6.716617e+01 -6.567442e+01 0.000645 0.000255
from sklearn.metrics import r2_score, make_scorer
from sklearn.linear_model import Ridge
def scaled_r2(y_true, y_pred):
    return round(r2_score(y_true, y_pred) * 100, 1)


pnd = PoniardRegressor(
    metrics={
        "scaled_r2": make_scorer(scaled_r2, greater_is_better=True),
        "usual_r2": make_scorer(r2_score, greater_is_better=True),
    },
    estimators=[LinearRegression(), Ridge()],
)
pnd.setup(X, y, show_info=False).fit().get_results()
test_scaled_r2 test_usual_r2 fit_time score_time
LinearRegression 100.00 1.000000 0.001491 0.000342
Ridge 100.00 0.999994 0.001253 0.000301
DummyRegressor -0.88 -0.008754 0.000607 0.000254

The order in which scorers are passed matters; to be precise, the first scorer will be used in some methods if no other metric is defined.

print(pnd.metrics)
fig = pnd.plot.permutation_importance("Ridge")
fig.show("notebook")
{'scaled_r2': make_scorer(scaled_r2), 'usual_r2': make_scorer(r2_score)}

cv

Cross validation can be anything that scikit-learn accepts. By default, classification tasks will be paired with a StratifiedKFold if the target is binary, and KFold otherwise. Regression tasks use KFold by default.

cv=int or cv=None are internally converted to one of the above classes so that Poniard’s random_state parameter can be passed on.

from sklearn.model_selection import RepeatedKFold
pnd_5 = PoniardRegressor(cv=4).setup(X, y, show_info=False)
pnd_none = PoniardRegressor(cv=None).setup(X, y, show_info=False)
pnd_k = PoniardRegressor(cv=RepeatedKFold(n_splits=3)).setup(X, y, show_info=False)

print(pnd_5.cv, pnd_none.cv, pnd_k.cv, sep="\n")
KFold(n_splits=4, random_state=0, shuffle=True)
KFold(n_splits=5, random_state=0, shuffle=True)
RepeatedKFold(n_repeats=10, n_splits=3, random_state=0)

Note that even though we didn’t specify random_state for the third estimator, it gets injected during setup.