from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from poniard import PoniardClassifier
Main parameters
Introduction
At the core of Poniard lie the choice of estimators, metrics and CV strategy. While defaults might work for most cases, we try to keep it flexible.
estimators
Estimators can be passed as a dict of estimator_name: estimator_instance
, as a sequence of estimator_instance
or as a single estimator. When not specifying names, they will be obtained directly from the class.
Using a dictionary allows passing multiple instances of the same estimator with different parameters.
= make_classification(n_classes=3, n_features=5, n_informative=3)
X, y = PoniardClassifier(
pnd ={
estimators"lr": LogisticRegression(max_iter=5000),
"lr_no_penalty": LogisticRegression(max_iter=5000, penalty="none"),
"lda": LinearDiscriminantAnalysis(),
}
)
pnd.setup(X, y) pnd.fit()
Setup info
Target
Type: multiclass
Shape: (100,)
Unique values: 3
Metrics
Main metric: roc_auc_ovrFeature type inference
Minimum unique values to consider a number-like feature numeric: 10
Minimum unique values to consider a categorical feature high cardinality: 20
Inferred feature types:
numeric | categorical_high | categorical_low | datetime | |
---|---|---|---|---|
0 | 0.0 | |||
1 | 1.0 | |||
2 | 2.0 | |||
3 | 3.0 | |||
4 | 4.0 |
PoniardClassifier(estimators={'lr': LogisticRegression(max_iter=5000, random_state=0), 'lr_no_penalty': LogisticRegression(max_iter=5000, penalty='none', random_state=0), 'lda': LinearDiscriminantAnalysis()})
Since we are in scikit-learn-land, most of the stuff you expect to work still works. For example, multilabel classification.
Here we had to use a dictionary because estimator.__class__.__name__
, which is used for assigning a name to each estimator when a list is passed, would be the same for both OneVsRestClassifier
and they would be overwritten.
from sklearn.datasets import make_multilabel_classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
= make_multilabel_classification(n_samples=1000, n_features=6)
X, y = PoniardClassifier(
pnd ={
estimators"rf": OneVsRestClassifier(RandomForestClassifier()),
"lr": OneVsRestClassifier(LogisticRegression(max_iter=5000)),
}
)=False)
pnd.setup(X, y, show_info pnd.fit()
/Users/rafxavier/Documents/Repos/personal/poniard/poniard/preprocessing/core.py:145: UserWarning: TargetEncoder is not supported for multilabel or multioutput targets. Switching to OrdinalEncoder.
) = self._setup_transformers()
PoniardClassifier(estimators={'rf': OneVsRestClassifier(estimator=RandomForestClassifier()), 'lr': OneVsRestClassifier(estimator=LogisticRegression(max_iter=5000))})
pnd.get_results()
test_roc_auc | test_accuracy | test_precision_macro | test_recall_macro | test_f1_macro | fit_time | score_time | |
---|---|---|---|---|---|---|---|
rf | 0.841913 | 0.404 | 0.718702 | 0.592022 | 0.635481 | 0.568812 | 0.048382 |
lr | 0.801743 | 0.320 | 0.651472 | 0.532183 | 0.570917 | 0.142863 | 0.009468 |
DummyClassifier | 0.500000 | 0.093 | 0.194000 | 0.360000 | 0.251713 | 0.004363 | 0.006814 |
metrics
Metrics can be passed as a list of strings such as "accuracy"
or "neg_mean_squared_error"
, following the familiar scikit-learn nomenclature, or as a dict of str: Callable
. For convenience, it can also be a single string.
However, in a departure from scikit-learn, metrics
will fail if a Callable
is passed directly. This restriction is in place to facilitate naming columns in the PoniardBaseEstimator.get_results
method.
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from poniard import PoniardRegressor
= make_regression(n_samples=500, n_features=10, n_informative=5)
X, y = PoniardRegressor(
pnd =["explained_variance", "neg_median_absolute_error"],
metrics=[LinearRegression()],
estimators
)=False)
pnd.setup(X, y, show_info
pnd.fit()=True) pnd.get_results(return_train_scores
test_explained_variance | train_explained_variance | test_neg_median_absolute_error | train_neg_median_absolute_error | fit_time | score_time | |
---|---|---|---|---|---|---|
LinearRegression | 1.0 | 1.000000e+00 | -8.366641e-14 | -8.038015e-14 | 0.001400 | 0.000375 |
DummyRegressor | 0.0 | 4.440892e-17 | -6.716617e+01 | -6.567442e+01 | 0.000645 | 0.000255 |
from sklearn.metrics import r2_score, make_scorer
from sklearn.linear_model import Ridge
def scaled_r2(y_true, y_pred):
return round(r2_score(y_true, y_pred) * 100, 1)
= PoniardRegressor(
pnd ={
metrics"scaled_r2": make_scorer(scaled_r2, greater_is_better=True),
"usual_r2": make_scorer(r2_score, greater_is_better=True),
},=[LinearRegression(), Ridge()],
estimators
)=False).fit().get_results() pnd.setup(X, y, show_info
test_scaled_r2 | test_usual_r2 | fit_time | score_time | |
---|---|---|---|---|
LinearRegression | 100.00 | 1.000000 | 0.001491 | 0.000342 |
Ridge | 100.00 | 0.999994 | 0.001253 | 0.000301 |
DummyRegressor | -0.88 | -0.008754 | 0.000607 | 0.000254 |
The order in which scorers are passed matters; to be precise, the first scorer will be used in some methods if no other metric is defined.
print(pnd.metrics)
= pnd.plot.permutation_importance("Ridge")
fig "notebook") fig.show(
{'scaled_r2': make_scorer(scaled_r2), 'usual_r2': make_scorer(r2_score)}
cv
Cross validation can be anything that scikit-learn accepts. By default, classification tasks will be paired with a StratifiedKFold
if the target is binary, and KFold
otherwise. Regression tasks use KFold
by default.
cv=int
or cv=None
are internally converted to one of the above classes so that Poniard’s random_state
parameter can be passed on.
from sklearn.model_selection import RepeatedKFold
= PoniardRegressor(cv=4).setup(X, y, show_info=False)
pnd_5 = PoniardRegressor(cv=None).setup(X, y, show_info=False)
pnd_none = PoniardRegressor(cv=RepeatedKFold(n_splits=3)).setup(X, y, show_info=False)
pnd_k
print(pnd_5.cv, pnd_none.cv, pnd_k.cv, sep="\n")
KFold(n_splits=4, random_state=0, shuffle=True)
KFold(n_splits=5, random_state=0, shuffle=True)
RepeatedKFold(n_repeats=10, n_splits=3, random_state=0)
Note that even though we didn’t specify random_state
for the third estimator, it gets injected during setup.