from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from poniard import PoniardClassifierMain parameters
Introduction
At the core of Poniard lie the choice of estimators, metrics and CV strategy. While defaults might work for most cases, we try to keep it flexible.
estimators
Estimators can be passed as a dict of estimator_name: estimator_instance, as a sequence of estimator_instance or as a single estimator. When not specifying names, they will be obtained directly from the class.
Using a dictionary allows passing multiple instances of the same estimator with different parameters.
X, y = make_classification(n_classes=3, n_features=5, n_informative=3)
pnd = PoniardClassifier(
estimators={
"lr": LogisticRegression(max_iter=5000),
"lr_no_penalty": LogisticRegression(max_iter=5000, penalty="none"),
"lda": LinearDiscriminantAnalysis(),
}
)
pnd.setup(X, y)
pnd.fit()Setup info
Target
Type: multiclass
Shape: (100,)
Unique values: 3
Metrics
Main metric: roc_auc_ovrFeature type inference
Minimum unique values to consider a number-like feature numeric: 10
Minimum unique values to consider a categorical feature high cardinality: 20
Inferred feature types:
| numeric | categorical_high | categorical_low | datetime | |
|---|---|---|---|---|
| 0 | 0.0 | |||
| 1 | 1.0 | |||
| 2 | 2.0 | |||
| 3 | 3.0 | |||
| 4 | 4.0 |
PoniardClassifier(estimators={'lr': LogisticRegression(max_iter=5000, random_state=0), 'lr_no_penalty': LogisticRegression(max_iter=5000, penalty='none', random_state=0), 'lda': LinearDiscriminantAnalysis()})
Since we are in scikit-learn-land, most of the stuff you expect to work still works. For example, multilabel classification.
Here we had to use a dictionary because estimator.__class__.__name__, which is used for assigning a name to each estimator when a list is passed, would be the same for both OneVsRestClassifier and they would be overwritten.
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegressionX, y = make_multilabel_classification(n_samples=1000, n_features=6)
pnd = PoniardClassifier(
estimators={
"rf": OneVsRestClassifier(RandomForestClassifier()),
"lr": OneVsRestClassifier(LogisticRegression(max_iter=5000)),
}
)
pnd.setup(X, y, show_info=False)
pnd.fit()/Users/rafxavier/Documents/Repos/personal/poniard/poniard/preprocessing/core.py:145: UserWarning: TargetEncoder is not supported for multilabel or multioutput targets. Switching to OrdinalEncoder.
) = self._setup_transformers()
PoniardClassifier(estimators={'rf': OneVsRestClassifier(estimator=RandomForestClassifier()), 'lr': OneVsRestClassifier(estimator=LogisticRegression(max_iter=5000))})
pnd.get_results()| test_roc_auc | test_accuracy | test_precision_macro | test_recall_macro | test_f1_macro | fit_time | score_time | |
|---|---|---|---|---|---|---|---|
| rf | 0.841913 | 0.404 | 0.718702 | 0.592022 | 0.635481 | 0.568812 | 0.048382 |
| lr | 0.801743 | 0.320 | 0.651472 | 0.532183 | 0.570917 | 0.142863 | 0.009468 |
| DummyClassifier | 0.500000 | 0.093 | 0.194000 | 0.360000 | 0.251713 | 0.004363 | 0.006814 |
metrics
Metrics can be passed as a list of strings such as "accuracy" or "neg_mean_squared_error", following the familiar scikit-learn nomenclature, or as a dict of str: Callable. For convenience, it can also be a single string.
However, in a departure from scikit-learn, metrics will fail if a Callable is passed directly. This restriction is in place to facilitate naming columns in the PoniardBaseEstimator.get_results method.
scoring vs. metrics parameters
In scikit-learn parlance, a metric is a measure of the prediction error of a model. Scoring, which is used in sklearn model evaluation objects (like GridSearchCV or cross_validate), reflects the same, but with the restriction that higher values are better than lower values. Poniard uses the parameter name metrics for now, but will eventually migrate to scoring as that reflects more accurately its meaning.
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from poniard import PoniardRegressorX, y = make_regression(n_samples=500, n_features=10, n_informative=5)
pnd = PoniardRegressor(
metrics=["explained_variance", "neg_median_absolute_error"],
estimators=[LinearRegression()],
)
pnd.setup(X, y, show_info=False)
pnd.fit()
pnd.get_results(return_train_scores=True)| test_explained_variance | train_explained_variance | test_neg_median_absolute_error | train_neg_median_absolute_error | fit_time | score_time | |
|---|---|---|---|---|---|---|
| LinearRegression | 1.0 | 1.000000e+00 | -8.366641e-14 | -8.038015e-14 | 0.001400 | 0.000375 |
| DummyRegressor | 0.0 | 4.440892e-17 | -6.716617e+01 | -6.567442e+01 | 0.000645 | 0.000255 |
from sklearn.metrics import r2_score, make_scorer
from sklearn.linear_model import Ridgedef scaled_r2(y_true, y_pred):
return round(r2_score(y_true, y_pred) * 100, 1)
pnd = PoniardRegressor(
metrics={
"scaled_r2": make_scorer(scaled_r2, greater_is_better=True),
"usual_r2": make_scorer(r2_score, greater_is_better=True),
},
estimators=[LinearRegression(), Ridge()],
)
pnd.setup(X, y, show_info=False).fit().get_results()| test_scaled_r2 | test_usual_r2 | fit_time | score_time | |
|---|---|---|---|---|
| LinearRegression | 100.00 | 1.000000 | 0.001491 | 0.000342 |
| Ridge | 100.00 | 0.999994 | 0.001253 | 0.000301 |
| DummyRegressor | -0.88 | -0.008754 | 0.000607 | 0.000254 |
The order in which scorers are passed matters; to be precise, the first scorer will be used in some methods if no other metric is defined.
print(pnd.metrics)
fig = pnd.plot.permutation_importance("Ridge")
fig.show("notebook"){'scaled_r2': make_scorer(scaled_r2), 'usual_r2': make_scorer(r2_score)}
cv
Cross validation can be anything that scikit-learn accepts. By default, classification tasks will be paired with a StratifiedKFold if the target is binary, and KFold otherwise. Regression tasks use KFold by default.
cv=int or cv=None are internally converted to one of the above classes so that Poniard’s random_state parameter can be passed on.
from sklearn.model_selection import RepeatedKFoldpnd_5 = PoniardRegressor(cv=4).setup(X, y, show_info=False)
pnd_none = PoniardRegressor(cv=None).setup(X, y, show_info=False)
pnd_k = PoniardRegressor(cv=RepeatedKFold(n_splits=3)).setup(X, y, show_info=False)
print(pnd_5.cv, pnd_none.cv, pnd_k.cv, sep="\n")KFold(n_splits=4, random_state=0, shuffle=True)
KFold(n_splits=5, random_state=0, shuffle=True)
RepeatedKFold(n_repeats=10, n_splits=3, random_state=0)
Note that even though we didn’t specify random_state for the third estimator, it gets injected during setup.