A slightly more complex case study

Let’s try Poniard on the Adult Census dataset!

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from poniard import PoniardClassifier
from poniard.plugins import WandBPlugin, PandasProfilingPlugin
from poniard.error_analysis import ErrorAnalyzer

We’ll get the data and sample 10.000 observations to speed up training. Also, we’re going to cast pd.Categorical columns to object since scikit-learn doesn’t play will with them, drop the “fnlwgt” column as it refers to survey weights that should not provide any relevant information, and cast the target to 1-0.

# Adult Census dataset
X, y = fetch_openml(data_id=1590, return_X_y=True, as_frame=True)

X = X.sample(n=10000, random_state=0).drop("fnlwgt", axis=1)
y = y.reindex(X.index)

category_cols = X.select_dtypes(include="category").columns
X = X.astype({col: object for col in category_cols})

y = y.replace({">50K": 1, "<=50K": 0})

Next we split and pass only the training data to Poniard.

We’ll be using the 2 available plugins.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

pnd = PoniardClassifier(
    n_jobs=-1,
    plugins=[WandBPlugin(project="adult-demo"), PandasProfilingPlugin()],
)
pnd.setup(X_train, y_train)
pnd.remove_estimators("SVC")  # Doesn't scale nicely

wandb: Currently logged in as: rxavier. Use `wandb login --relogin` to force relogin

wandb version 0.13.5 is available! To upgrade, please run: $ pip install wandb --upgrade

Tracking run with wandb version 0.13.3

Run data is saved locally in /Users/rafxavier/Documents/Repos/personal/poniard/nbs/guide/wandb/run-20221203_231425-1n5yya80

Syncing run wandering-butterfly-15 to Weights & Biases (docs)

Setup info

Target

Type: binary

Shape: (8000,)

Unique values: 2

Metrics

Main metric: roc_auc

Feature type inference

Minimum unique values to consider a number-like feature numeric: 800

Minimum unique values to consider a categorical feature high cardinality: 20

Inferred feature types:

	categorical_high	categorical_low
0	age	education-num
1	capital-gain	workclass
2	capital-loss	education
3	hours-per-week	marital-status
4	native-country	occupation
5		relationship
6		race
7		sex

PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])

As can be seen, Pandas Profiling already created a report and saved it to the default location. If ipywidgets is installed, the report will be included in the output.

Meanwhile, Weights and Biases either logged in or prompted for a login, and started logging information about the run (preprocessor HTML representation, dataset, inferred types). Also, because plugins can check whether other plugins are included in a Poniard estimator (by using BasePlugin._check_plugin_used), wandb also uploaded the profile report.

Right away there’s some misclassified features, so we’ll reassign them.

pnd.reassign_types(numeric=["age", "capital-gain", "capital-loss", "hours-per-week"])

Assigned feature types:

	numeric	categorical_high	categorical_low
0	age	native-country	education-num
1	capital-gain		workclass
2	capital-loss		education
3	hours-per-week		marital-status
4			occupation
5			relationship
6			race
7			sex

PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])

pnd.fit()
pnd.get_results()

wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: Network error (ReadTimeout), entering retry loop.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.

	test_roc_auc	test_accuracy	test_precision	test_recall	test_f1	fit_time	score_time
HistGradientBoostingClassifier	0.915932	0.860000	0.750181	0.627740	0.683455	0.943964	0.059188
XGBClassifier	0.914350	0.859750	0.742949	0.638133	0.686398	2.057690	0.036659
LogisticRegression	0.898562	0.843250	0.714732	0.579968	0.640167	0.193686	0.027372
RandomForestClassifier	0.884420	0.835750	0.680117	0.600215	0.637618	6.557138	0.059342
KNeighborsClassifier	0.842528	0.815125	0.629923	0.562315	0.594060	0.058123	0.229496
GaussianNB	0.808289	0.583500	0.361538	0.931476	0.520086	0.055644	0.024651
DecisionTreeClassifier	0.743712	0.802375	0.587160	0.604880	0.595834	2.182848	0.024007
DummyClassifier	0.500000	0.759250	0.000000	0.000000	0.000000	0.058244	0.028908

The results table is nice, but we can also plot the metrics we’re interested in.

Since there’s imbalance in y, we look at the ROC AUC and F1 score as well as accuracy.

pnd.plot.metrics(metrics=["f1", "roc_auc", "accuracy"], kind="bar")

pnd.plot.roc_curve(estimator_names=["HistGradientBoostingClassifier", "XGBClassifier"])

Performance between HistGradientBoostingClassifier and XGBoost is very similar. We can go ahead and try to squeeze a bit more using hyperparameter tuning with PoniardBaseEstimator.tune_estimator.

Note that we’re building custom parameter grids, since the grids included in Poniard are smaller and better suited to full grid search, instead of the random search we will use.

name = "HistGradientBoostingClassifier"
grid = {
    f"{name}__learning_rate": np.arange(0.1, 1.1, 0.1),
    f"{name}__max_iter": np.arange(100, 520, 20),
    f"{name}__max_leaf_nodes": np.arange(10, 110, 10),
    f"{name}__l2_regularization": np.arange(0, 1.1, 0.1),
}
pnd.tune_estimator(name, grid=grid, mode="random", n_iter=100)

PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])

To keep it reasonable, we tune XGBoost for 50 rounds as it takes a lot longer than HGB.

name = "XGBClassifier"
grid = {
    f"{name}__n_estimators": np.arange(100, 520, 20),
    f"{name}__max_depth": np.arange(2, 21, 2),
    f"{name}__learning_rate": np.arange(0.1, 1.1, 0.1),
    f"{name}__min_child_weight": np.arange(1, 11, 1),
}
pnd.tune_estimator(name, grid=grid, mode="random", n_iter=50)

PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])

After tuning, the tuned estimators are added to PoniardBaseEstimator.pipelines, but they still need to be fit with CV. Calling PoniardBaseEstimator.fit will only run cross validation for new estimators only.

pnd.fit()
pnd.get_results()

	test_roc_auc	test_accuracy	test_precision	test_recall	test_f1	fit_time	score_time
HistGradientBoostingClassifier_tuned	0.917760	0.862625	0.757019	0.631896	0.688738	0.503392	0.034052
XGBClassifier_tuned	0.916398	0.858750	0.744218	0.629814	0.682057	1.103070	0.022097
HistGradientBoostingClassifier	0.915932	0.860000	0.750181	0.627740	0.683455	0.943964	0.059188
XGBClassifier	0.914350	0.859750	0.742949	0.638133	0.686398	2.057690	0.036659
LogisticRegression	0.898562	0.843250	0.714732	0.579968	0.640167	0.193686	0.027372
RandomForestClassifier	0.884420	0.835750	0.680117	0.600215	0.637618	6.557138	0.059342
KNeighborsClassifier	0.842528	0.815125	0.629923	0.562315	0.594060	0.058123	0.229496
GaussianNB	0.808289	0.583500	0.361538	0.931476	0.520086	0.055644	0.024651
DecisionTreeClassifier	0.743712	0.802375	0.587160	0.604880	0.595834	2.182848	0.024007
DummyClassifier	0.500000	0.759250	0.000000	0.000000	0.000000	0.058244	0.028908

From here we can get a summary of a given estimator’s performance with PoniardBaseEstimator.analyze_estimator.

pnd.analyze_estimator("HistGradientBoostingClassifier_tuned")

We can also understand where errors are coming from by leveraging ErrorAnalyzer.

analyzer = ErrorAnalyzer.from_poniard(
    pnd, estimator_names="HistGradientBoostingClassifier_tuned"
)
errors = analyzer.rank_errors()["HistGradientBoostingClassifier_tuned"]
errors["values"]

	y	prediction	proba_0	proba_1	error
5953	1	0	0.999105	0.000895	0.999105
29128	1	0	0.997760	0.002240	0.997760
786	0	1	0.003175	0.996825	0.996825
11144	1	0	0.996473	0.003527	0.996473
2004	1	0	0.995659	0.004341	0.995659
...	...	...	...	...	...
37335	0	1	0.498782	0.501218	0.501218
17237	1	0	0.501033	0.498967	0.501033
29824	1	0	0.500940	0.499060	0.500940
18163	1	0	0.500752	0.499248	0.500752
35847	0	1	0.499532	0.500468	0.500468

1099 rows × 5 columns

analyzer.analyze_target(errors["idx"], as_ratio=True)

	0_errors	0_target
class
1	0.645132	0.24075
0	0.354868	0.75925

analyzer.analyze_target(errors["idx"], wrt_target=True)

class
1    0.368120
0    0.064208
dtype: float64

As expected, the model’s errors are concentrated on the rare class.

We can also get a glimpse on how the features differ between errors and non-errors.

feature_analysis = analyzer.analyze_features(errors["idx"])
feature_analysis["marital-status"]

marital-status	Divorced	Married-AF-spouse	Married-civ-spouse	Married-spouse-absent	Never-married	Separated	Widowed
error
0	0.145921	0.000435	0.397044	0.011013	0.373134	0.037096	0.035357
1	0.068244	0.000910	0.806187	0.003640	0.094631	0.010009	0.016379

feature_analysis["age"]

	count	mean	std	min	25%	50%	75%	max
error
0	6901.0	37.824518	13.926955	17.0	26.0	36.0	47.0	90.0
1	1099.0	43.348499	11.097495	18.0	35.0	42.0	50.0	90.0

feature_analysis["capital-gain"]

	count	mean	std	min	25%	50%	75%	max
error
0	6901.0	1265.274743	8104.97948	0.0	0.0	0.0	0.0	99999.0
1	1099.0	184.174704	2052.24528	0.0	0.0	0.0	0.0	41310.0

We might benefit from building an ensemble from multiple estimators. However, we’d like the estimators to complement each other, i.e, make different mistakes. This is when PoniardBaseEstimator.get_predictions_similarity can help.

pnd.get_predictions_similarity(on_errors=True)

	LogisticRegression	GaussianNB	KNeighborsClassifier	DecisionTreeClassifier	RandomForestClassifier	HistGradientBoostingClassifier	XGBClassifier	HistGradientBoostingClassifier_tuned	XGBClassifier_tuned
LogisticRegression	1.000000	0.016189	0.578868	0.464133	0.600827	0.688545	0.655039	0.711309	0.690366
GaussianNB	0.016189	1.000000	0.050041	0.065285	0.047443	0.014467	0.017282	0.018000	0.024393
KNeighborsClassifier	0.578868	0.050041	1.000000	0.480397	0.643573	0.582190	0.562862	0.568010	0.569996
DecisionTreeClassifier	0.464133	0.065285	0.480397	1.000000	0.618275	0.529288	0.533057	0.505370	0.496691
RandomForestClassifier	0.600827	0.047443	0.643573	0.618275	1.000000	0.659699	0.642367	0.627170	0.635312
HistGradientBoostingClassifier	0.688545	0.014467	0.582190	0.529288	0.659699	1.000000	0.846959	0.877606	0.833977
XGBClassifier	0.655039	0.017282	0.562862	0.533057	0.642367	0.846959	1.000000	0.844242	0.813421
HistGradientBoostingClassifier_tuned	0.711309	0.018000	0.568010	0.505370	0.627170	0.877606	0.844242	1.000000	0.839548
XGBClassifier_tuned	0.690366	0.024393	0.569996	0.496691	0.635312	0.833977	0.813421	0.839548	1.000000

Analyzing the results so far and the similarity table, it looks like including LogisticRegression along with the gradient boosters could be a good idea, so we’ll go ahead and tune LR.

name = "LogisticRegression"
grid = {
    f"{name}__C": np.geomspace(0.1, 100, 200),
    f"{name}__class_weight": [None, "balanced"],
}
pnd.tune_estimator(name, grid=grid, mode="random", n_iter=100)

PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])

Next, we build a voting classifier with equal weights.

pnd.build_ensemble(
    estimator_names=[
        "HistGradientBoostingClassifier_tuned",
        "XGBClassifier_tuned",
        "LogisticRegression_tuned",
    ],
    method="voting",
    voting="soft",
)
pnd.fit()
pnd.get_results()

	test_roc_auc	test_accuracy	test_precision	test_recall	test_f1	fit_time	score_time
HistGradientBoostingClassifier_tuned	0.917760	0.862625	0.757019	0.631896	0.688738	0.503392	0.034052
VotingClassifier	0.917293	0.863500	0.765881	0.623587	0.687299	2.131655	0.044481
XGBClassifier_tuned	0.916398	0.858750	0.744218	0.629814	0.682057	1.103070	0.022097
HistGradientBoostingClassifier	0.915932	0.860000	0.750181	0.627740	0.683455	0.943964	0.059188
XGBClassifier	0.914350	0.859750	0.742949	0.638133	0.686398	2.057690	0.036659
LogisticRegression_tuned	0.898678	0.842875	0.717942	0.571660	0.636357	0.131910	0.026045
LogisticRegression	0.898562	0.843250	0.714732	0.579968	0.640167	0.193686	0.027372
RandomForestClassifier	0.884420	0.835750	0.680117	0.600215	0.637618	6.557138	0.059342
KNeighborsClassifier	0.842528	0.815125	0.629923	0.562315	0.594060	0.058123	0.229496
GaussianNB	0.808289	0.583500	0.361538	0.931476	0.520086	0.055644	0.024651
DecisionTreeClassifier	0.743712	0.802375	0.587160	0.604880	0.595834	2.182848	0.024007
DummyClassifier	0.500000	0.759250	0.000000	0.000000	0.000000	0.058244	0.028908

We could even tune the ensemble weights.

weights = []

for _ in range(200):
    unscaled = np.random.uniform(0, 1, size=3)
    scaled = unscaled / np.sum(unscaled)
    weights.append(scaled.tolist())

pnd.tune_estimator(
    "VotingClassifier",
    grid={"VotingClassifier__weights": weights},
    mode="random",
    n_iter=10,
)

PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])

pnd.fit()
pnd.get_results()

	test_roc_auc	test_accuracy	test_precision	test_recall	test_f1	fit_time	score_time
VotingClassifier_tuned	0.918799	0.862625	0.760029	0.627224	0.687134	2.107462	0.050291
HistGradientBoostingClassifier_tuned	0.917760	0.862625	0.757019	0.631896	0.688738	0.503392	0.034052
VotingClassifier	0.917293	0.863500	0.765881	0.623587	0.687299	2.131655	0.044481
XGBClassifier_tuned	0.916398	0.858750	0.744218	0.629814	0.682057	1.103070	0.022097
HistGradientBoostingClassifier	0.915932	0.860000	0.750181	0.627740	0.683455	0.943964	0.059188
XGBClassifier	0.914350	0.859750	0.742949	0.638133	0.686398	2.057690	0.036659
LogisticRegression_tuned	0.898678	0.842875	0.717942	0.571660	0.636357	0.131910	0.026045
LogisticRegression	0.898562	0.843250	0.714732	0.579968	0.640167	0.193686	0.027372
RandomForestClassifier	0.884420	0.835750	0.680117	0.600215	0.637618	6.557138	0.059342
KNeighborsClassifier	0.842528	0.815125	0.629923	0.562315	0.594060	0.058123	0.229496
GaussianNB	0.808289	0.583500	0.361538	0.931476	0.520086	0.055644	0.024651
DecisionTreeClassifier	0.743712	0.802375	0.587160	0.604880	0.595834	2.182848	0.024007
DummyClassifier	0.500000	0.759250	0.000000	0.000000	0.000000	0.058244	0.028908

pnd.plot.confusion_matrix("VotingClassifier_tuned", normalize="true")

pnd.plot.permutation_importance("VotingClassifier_tuned")

pnd.plot.partial_dependence("VotingClassifier_tuned", "marital-status")

final = pnd.get_estimator("VotingClassifier_tuned", retrain=True)
final

print(classification_report(y_test, final.predict(X_test), digits=3))

              precision    recall  f1-score   support

           0      0.896     0.946     0.920      1518
           1      0.793     0.654     0.717       482

    accuracy                          0.875      2000
   macro avg      0.845     0.800     0.818      2000
weighted avg      0.871     0.875     0.871      2000