A slightly more complex case study

Let’s try Poniard on the Adult Census dataset!
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from poniard import PoniardClassifier
from poniard.plugins import WandBPlugin, PandasProfilingPlugin
from poniard.error_analysis import ErrorAnalyzer

We’ll get the data and sample 10.000 observations to speed up training. Also, we’re going to cast pd.Categorical columns to object since scikit-learn doesn’t play will with them, drop the “fnlwgt” column as it refers to survey weights that should not provide any relevant information, and cast the target to 1-0.

# Adult Census dataset
X, y = fetch_openml(data_id=1590, return_X_y=True, as_frame=True)

X = X.sample(n=10000, random_state=0).drop("fnlwgt", axis=1)
y = y.reindex(X.index)

category_cols = X.select_dtypes(include="category").columns
X = X.astype({col: object for col in category_cols})

y = y.replace({">50K": 1, "<=50K": 0})

Next we split and pass only the training data to Poniard.

We’ll be using the 2 available plugins.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

pnd = PoniardClassifier(
    n_jobs=-1,
    plugins=[WandBPlugin(project="adult-demo"), PandasProfilingPlugin()],
)
pnd.setup(X_train, y_train)
pnd.remove_estimators("SVC")  # Doesn't scale nicely
wandb: Currently logged in as: rxavier. Use `wandb login --relogin` to force relogin
wandb version 0.13.5 is available! To upgrade, please run: $ pip install wandb --upgrade
Tracking run with wandb version 0.13.3
Run data is saved locally in /Users/rafxavier/Documents/Repos/personal/poniard/nbs/guide/wandb/run-20221203_231425-1n5yya80

Setup info

Target

Type: binary

Shape: (8000,)

Unique values: 2

Metrics

Main metric: roc_auc

Feature type inference

Minimum unique values to consider a number-like feature numeric: 800

Minimum unique values to consider a categorical feature high cardinality: 20

Inferred feature types:

numeric categorical_high categorical_low datetime
0 age education-num
1 capital-gain workclass
2 capital-loss education
3 hours-per-week marital-status
4 native-country occupation
5 relationship
6 race
7 sex
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])

As can be seen, Pandas Profiling already created a report and saved it to the default location. If ipywidgets is installed, the report will be included in the output.

Meanwhile, Weights and Biases either logged in or prompted for a login, and started logging information about the run (preprocessor HTML representation, dataset, inferred types). Also, because plugins can check whether other plugins are included in a Poniard estimator (by using BasePlugin._check_plugin_used), wandb also uploaded the profile report.

Right away there’s some misclassified features, so we’ll reassign them.

pnd.reassign_types(numeric=["age", "capital-gain", "capital-loss", "hours-per-week"])

Assigned feature types:

numeric categorical_high categorical_low datetime
0 age native-country education-num
1 capital-gain workclass
2 capital-loss education
3 hours-per-week marital-status
4 occupation
5 relationship
6 race
7 sex
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
pnd.fit()
pnd.get_results()
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: Network error (ReadTimeout), entering retry loop.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
test_roc_auc test_accuracy test_precision test_recall test_f1 fit_time score_time
HistGradientBoostingClassifier 0.915932 0.860000 0.750181 0.627740 0.683455 0.943964 0.059188
XGBClassifier 0.914350 0.859750 0.742949 0.638133 0.686398 2.057690 0.036659
LogisticRegression 0.898562 0.843250 0.714732 0.579968 0.640167 0.193686 0.027372
RandomForestClassifier 0.884420 0.835750 0.680117 0.600215 0.637618 6.557138 0.059342
KNeighborsClassifier 0.842528 0.815125 0.629923 0.562315 0.594060 0.058123 0.229496
GaussianNB 0.808289 0.583500 0.361538 0.931476 0.520086 0.055644 0.024651
DecisionTreeClassifier 0.743712 0.802375 0.587160 0.604880 0.595834 2.182848 0.024007
DummyClassifier 0.500000 0.759250 0.000000 0.000000 0.000000 0.058244 0.028908

The results table is nice, but we can also plot the metrics we’re interested in.

Since there’s imbalance in y, we look at the ROC AUC and F1 score as well as accuracy.

pnd.plot.metrics(metrics=["f1", "roc_auc", "accuracy"], kind="bar")
pnd.plot.roc_curve(estimator_names=["HistGradientBoostingClassifier", "XGBClassifier"])

Performance between HistGradientBoostingClassifier and XGBoost is very similar. We can go ahead and try to squeeze a bit more using hyperparameter tuning with PoniardBaseEstimator.tune_estimator.

Note that we’re building custom parameter grids, since the grids included in Poniard are smaller and better suited to full grid search, instead of the random search we will use.

name = "HistGradientBoostingClassifier"
grid = {
    f"{name}__learning_rate": np.arange(0.1, 1.1, 0.1),
    f"{name}__max_iter": np.arange(100, 520, 20),
    f"{name}__max_leaf_nodes": np.arange(10, 110, 10),
    f"{name}__l2_regularization": np.arange(0, 1.1, 0.1),
}
pnd.tune_estimator(name, grid=grid, mode="random", n_iter=100)
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])

To keep it reasonable, we tune XGBoost for 50 rounds as it takes a lot longer than HGB.

name = "XGBClassifier"
grid = {
    f"{name}__n_estimators": np.arange(100, 520, 20),
    f"{name}__max_depth": np.arange(2, 21, 2),
    f"{name}__learning_rate": np.arange(0.1, 1.1, 0.1),
    f"{name}__min_child_weight": np.arange(1, 11, 1),
}
pnd.tune_estimator(name, grid=grid, mode="random", n_iter=50)
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])

After tuning, the tuned estimators are added to PoniardBaseEstimator.pipelines, but they still need to be fit with CV. Calling PoniardBaseEstimator.fit will only run cross validation for new estimators only.

pnd.fit()
pnd.get_results()
test_roc_auc test_accuracy test_precision test_recall test_f1 fit_time score_time
HistGradientBoostingClassifier_tuned 0.917760 0.862625 0.757019 0.631896 0.688738 0.503392 0.034052
XGBClassifier_tuned 0.916398 0.858750 0.744218 0.629814 0.682057 1.103070 0.022097
HistGradientBoostingClassifier 0.915932 0.860000 0.750181 0.627740 0.683455 0.943964 0.059188
XGBClassifier 0.914350 0.859750 0.742949 0.638133 0.686398 2.057690 0.036659
LogisticRegression 0.898562 0.843250 0.714732 0.579968 0.640167 0.193686 0.027372
RandomForestClassifier 0.884420 0.835750 0.680117 0.600215 0.637618 6.557138 0.059342
KNeighborsClassifier 0.842528 0.815125 0.629923 0.562315 0.594060 0.058123 0.229496
GaussianNB 0.808289 0.583500 0.361538 0.931476 0.520086 0.055644 0.024651
DecisionTreeClassifier 0.743712 0.802375 0.587160 0.604880 0.595834 2.182848 0.024007
DummyClassifier 0.500000 0.759250 0.000000 0.000000 0.000000 0.058244 0.028908

From here we can get a summary of a given estimator’s performance with PoniardBaseEstimator.analyze_estimator.

pnd.analyze_estimator("HistGradientBoostingClassifier_tuned")

We can also understand where errors are coming from by leveraging ErrorAnalyzer.

analyzer = ErrorAnalyzer.from_poniard(
    pnd, estimator_names="HistGradientBoostingClassifier_tuned"
)
errors = analyzer.rank_errors()["HistGradientBoostingClassifier_tuned"]
errors["values"]
y prediction proba_0 proba_1 error
5953 1 0 0.999105 0.000895 0.999105
29128 1 0 0.997760 0.002240 0.997760
786 0 1 0.003175 0.996825 0.996825
11144 1 0 0.996473 0.003527 0.996473
2004 1 0 0.995659 0.004341 0.995659
... ... ... ... ... ...
37335 0 1 0.498782 0.501218 0.501218
17237 1 0 0.501033 0.498967 0.501033
29824 1 0 0.500940 0.499060 0.500940
18163 1 0 0.500752 0.499248 0.500752
35847 0 1 0.499532 0.500468 0.500468

1099 rows × 5 columns

analyzer.analyze_target(errors["idx"], as_ratio=True)
0_errors 0_target
class
1 0.645132 0.24075
0 0.354868 0.75925
analyzer.analyze_target(errors["idx"], wrt_target=True)
class
1    0.368120
0    0.064208
dtype: float64

As expected, the model’s errors are concentrated on the rare class.

We can also get a glimpse on how the features differ between errors and non-errors.

feature_analysis = analyzer.analyze_features(errors["idx"])
feature_analysis["marital-status"]
marital-status Divorced Married-AF-spouse Married-civ-spouse Married-spouse-absent Never-married Separated Widowed
error
0 0.145921 0.000435 0.397044 0.011013 0.373134 0.037096 0.035357
1 0.068244 0.000910 0.806187 0.003640 0.094631 0.010009 0.016379
feature_analysis["age"]
count mean std min 25% 50% 75% max
error
0 6901.0 37.824518 13.926955 17.0 26.0 36.0 47.0 90.0
1 1099.0 43.348499 11.097495 18.0 35.0 42.0 50.0 90.0
feature_analysis["capital-gain"]
count mean std min 25% 50% 75% max
error
0 6901.0 1265.274743 8104.97948 0.0 0.0 0.0 0.0 99999.0
1 1099.0 184.174704 2052.24528 0.0 0.0 0.0 0.0 41310.0

We might benefit from building an ensemble from multiple estimators. However, we’d like the estimators to complement each other, i.e, make different mistakes. This is when PoniardBaseEstimator.get_predictions_similarity can help.

pnd.get_predictions_similarity(on_errors=True)
LogisticRegression GaussianNB KNeighborsClassifier DecisionTreeClassifier RandomForestClassifier HistGradientBoostingClassifier XGBClassifier HistGradientBoostingClassifier_tuned XGBClassifier_tuned
LogisticRegression 1.000000 0.016189 0.578868 0.464133 0.600827 0.688545 0.655039 0.711309 0.690366
GaussianNB 0.016189 1.000000 0.050041 0.065285 0.047443 0.014467 0.017282 0.018000 0.024393
KNeighborsClassifier 0.578868 0.050041 1.000000 0.480397 0.643573 0.582190 0.562862 0.568010 0.569996
DecisionTreeClassifier 0.464133 0.065285 0.480397 1.000000 0.618275 0.529288 0.533057 0.505370 0.496691
RandomForestClassifier 0.600827 0.047443 0.643573 0.618275 1.000000 0.659699 0.642367 0.627170 0.635312
HistGradientBoostingClassifier 0.688545 0.014467 0.582190 0.529288 0.659699 1.000000 0.846959 0.877606 0.833977
XGBClassifier 0.655039 0.017282 0.562862 0.533057 0.642367 0.846959 1.000000 0.844242 0.813421
HistGradientBoostingClassifier_tuned 0.711309 0.018000 0.568010 0.505370 0.627170 0.877606 0.844242 1.000000 0.839548
XGBClassifier_tuned 0.690366 0.024393 0.569996 0.496691 0.635312 0.833977 0.813421 0.839548 1.000000

Analyzing the results so far and the similarity table, it looks like including LogisticRegression along with the gradient boosters could be a good idea, so we’ll go ahead and tune LR.

name = "LogisticRegression"
grid = {
    f"{name}__C": np.geomspace(0.1, 100, 200),
    f"{name}__class_weight": [None, "balanced"],
}
pnd.tune_estimator(name, grid=grid, mode="random", n_iter=100)
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])

Next, we build a voting classifier with equal weights.

pnd.build_ensemble(
    estimator_names=[
        "HistGradientBoostingClassifier_tuned",
        "XGBClassifier_tuned",
        "LogisticRegression_tuned",
    ],
    method="voting",
    voting="soft",
)
pnd.fit()
pnd.get_results()
test_roc_auc test_accuracy test_precision test_recall test_f1 fit_time score_time
HistGradientBoostingClassifier_tuned 0.917760 0.862625 0.757019 0.631896 0.688738 0.503392 0.034052
VotingClassifier 0.917293 0.863500 0.765881 0.623587 0.687299 2.131655 0.044481
XGBClassifier_tuned 0.916398 0.858750 0.744218 0.629814 0.682057 1.103070 0.022097
HistGradientBoostingClassifier 0.915932 0.860000 0.750181 0.627740 0.683455 0.943964 0.059188
XGBClassifier 0.914350 0.859750 0.742949 0.638133 0.686398 2.057690 0.036659
LogisticRegression_tuned 0.898678 0.842875 0.717942 0.571660 0.636357 0.131910 0.026045
LogisticRegression 0.898562 0.843250 0.714732 0.579968 0.640167 0.193686 0.027372
RandomForestClassifier 0.884420 0.835750 0.680117 0.600215 0.637618 6.557138 0.059342
KNeighborsClassifier 0.842528 0.815125 0.629923 0.562315 0.594060 0.058123 0.229496
GaussianNB 0.808289 0.583500 0.361538 0.931476 0.520086 0.055644 0.024651
DecisionTreeClassifier 0.743712 0.802375 0.587160 0.604880 0.595834 2.182848 0.024007
DummyClassifier 0.500000 0.759250 0.000000 0.000000 0.000000 0.058244 0.028908

We could even tune the ensemble weights.

weights = []

for _ in range(200):
    unscaled = np.random.uniform(0, 1, size=3)
    scaled = unscaled / np.sum(unscaled)
    weights.append(scaled.tolist())

pnd.tune_estimator(
    "VotingClassifier",
    grid={"VotingClassifier__weights": weights},
    mode="random",
    n_iter=10,
)
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
pnd.fit()
pnd.get_results()
test_roc_auc test_accuracy test_precision test_recall test_f1 fit_time score_time
VotingClassifier_tuned 0.918799 0.862625 0.760029 0.627224 0.687134 2.107462 0.050291
HistGradientBoostingClassifier_tuned 0.917760 0.862625 0.757019 0.631896 0.688738 0.503392 0.034052
VotingClassifier 0.917293 0.863500 0.765881 0.623587 0.687299 2.131655 0.044481
XGBClassifier_tuned 0.916398 0.858750 0.744218 0.629814 0.682057 1.103070 0.022097
HistGradientBoostingClassifier 0.915932 0.860000 0.750181 0.627740 0.683455 0.943964 0.059188
XGBClassifier 0.914350 0.859750 0.742949 0.638133 0.686398 2.057690 0.036659
LogisticRegression_tuned 0.898678 0.842875 0.717942 0.571660 0.636357 0.131910 0.026045
LogisticRegression 0.898562 0.843250 0.714732 0.579968 0.640167 0.193686 0.027372
RandomForestClassifier 0.884420 0.835750 0.680117 0.600215 0.637618 6.557138 0.059342
KNeighborsClassifier 0.842528 0.815125 0.629923 0.562315 0.594060 0.058123 0.229496
GaussianNB 0.808289 0.583500 0.361538 0.931476 0.520086 0.055644 0.024651
DecisionTreeClassifier 0.743712 0.802375 0.587160 0.604880 0.595834 2.182848 0.024007
DummyClassifier 0.500000 0.759250 0.000000 0.000000 0.000000 0.058244 0.028908
pnd.plot.confusion_matrix("VotingClassifier_tuned", normalize="true")
pnd.plot.permutation_importance("VotingClassifier_tuned")
pnd.plot.partial_dependence("VotingClassifier_tuned", "marital-status")
final = pnd.get_estimator("VotingClassifier_tuned", retrain=True)
final
Pipeline(steps=[('preprocessor',
                 Pipeline(steps=[('type_preprocessor',
                                  ColumnTransformer(transformers=[('numeric_preprocessor',
                                                                   Pipeline(steps=[('numeric_imputer',
                                                                                    SimpleImputer()),
                                                                                   ('scaler',
                                                                                    StandardScaler())]),
                                                                   ['age',
                                                                    'capital-gain',
                                                                    'capital-loss',
                                                                    'hours-per-week']),
                                                                  ('categorical_low_preprocessor',
                                                                   Pipeline(steps=[('categorical_imputer',
                                                                                    SimpleImputer(st...
                                                             predictor='auto',
                                                             random_state=0,
                                                             reg_alpha=0,
                                                             reg_lambda=1,
                                                             scale_pos_weight=1,
                                                             subsample=1,
                                                             tree_method='exact',
                                                             use_label_encoder=False,
                                                             validate_parameters=1,
                                                             verbosity=0)),
                                              ('LogisticRegression_tuned',
                                               LogisticRegression(C=0.3255088599835058,
                                                                  max_iter=5000,
                                                                  random_state=0))],
                                  verbose=0, voting='soft',
                                  weights=[0.5963099656272408,
                                           0.2756042875483777,
                                           0.1280857468243813]))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
print(classification_report(y_test, final.predict(X_test), digits=3))
              precision    recall  f1-score   support

           0      0.896     0.946     0.920      1518
           1      0.793     0.654     0.717       482

    accuracy                          0.875      2000
   macro avg      0.845     0.800     0.818      2000
weighted avg      0.871     0.875     0.871      2000