import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from poniard import PoniardClassifier
from poniard.plugins import WandBPlugin, PandasProfilingPlugin
from poniard.error_analysis import ErrorAnalyzer
A slightly more complex case study
We’ll get the data and sample 10.000 observations to speed up training. Also, we’re going to cast pd.Categorical
columns to object
since scikit-learn doesn’t play will with them, drop the “fnlwgt” column as it refers to survey weights that should not provide any relevant information, and cast the target to 1-0.
# Adult Census dataset
= fetch_openml(data_id=1590, return_X_y=True, as_frame=True)
X, y
= X.sample(n=10000, random_state=0).drop("fnlwgt", axis=1)
X = y.reindex(X.index)
y
= X.select_dtypes(include="category").columns
category_cols = X.astype({col: object for col in category_cols})
X
= y.replace({">50K": 1, "<=50K": 0}) y
Next we split and pass only the training data to Poniard.
We’ll be using the 2 available plugins.
= train_test_split(X, y, test_size=0.2, stratify=y)
X_train, X_test, y_train, y_test
= PoniardClassifier(
pnd =-1,
n_jobs=[WandBPlugin(project="adult-demo"), PandasProfilingPlugin()],
plugins
)
pnd.setup(X_train, y_train)"SVC") # Doesn't scale nicely pnd.remove_estimators(
wandb: Currently logged in as: rxavier. Use `wandb login --relogin` to force relogin
/Users/rafxavier/Documents/Repos/personal/poniard/nbs/guide/wandb/run-20221203_231425-1n5yya80
Setup info
Target
Type: binary
Shape: (8000,)
Unique values: 2
Metrics
Main metric: roc_aucFeature type inference
Minimum unique values to consider a number-like feature numeric: 800
Minimum unique values to consider a categorical feature high cardinality: 20
Inferred feature types:
numeric | categorical_high | categorical_low | datetime | |
---|---|---|---|---|
0 | age | education-num | ||
1 | capital-gain | workclass | ||
2 | capital-loss | education | ||
3 | hours-per-week | marital-status | ||
4 | native-country | occupation | ||
5 | relationship | |||
6 | race | |||
7 | sex |
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
As can be seen, Pandas Profiling already created a report and saved it to the default location. If ipywidgets
is installed, the report will be included in the output.
Meanwhile, Weights and Biases either logged in or prompted for a login, and started logging information about the run (preprocessor HTML representation, dataset, inferred types). Also, because plugins can check whether other plugins are included in a Poniard estimator (by using BasePlugin._check_plugin_used
), wandb also uploaded the profile report.
Right away there’s some misclassified features, so we’ll reassign them.
=["age", "capital-gain", "capital-loss", "hours-per-week"]) pnd.reassign_types(numeric
Assigned feature types:
numeric | categorical_high | categorical_low | datetime | |
---|---|---|---|---|
0 | age | native-country | education-num | |
1 | capital-gain | workclass | ||
2 | capital-loss | education | ||
3 | hours-per-week | marital-status | ||
4 | occupation | |||
5 | relationship | |||
6 | race | |||
7 | sex |
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
pnd.fit() pnd.get_results()
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: Network error (ReadTimeout), entering retry loop.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
---|---|---|---|---|---|---|---|
HistGradientBoostingClassifier | 0.915932 | 0.860000 | 0.750181 | 0.627740 | 0.683455 | 0.943964 | 0.059188 |
XGBClassifier | 0.914350 | 0.859750 | 0.742949 | 0.638133 | 0.686398 | 2.057690 | 0.036659 |
LogisticRegression | 0.898562 | 0.843250 | 0.714732 | 0.579968 | 0.640167 | 0.193686 | 0.027372 |
RandomForestClassifier | 0.884420 | 0.835750 | 0.680117 | 0.600215 | 0.637618 | 6.557138 | 0.059342 |
KNeighborsClassifier | 0.842528 | 0.815125 | 0.629923 | 0.562315 | 0.594060 | 0.058123 | 0.229496 |
GaussianNB | 0.808289 | 0.583500 | 0.361538 | 0.931476 | 0.520086 | 0.055644 | 0.024651 |
DecisionTreeClassifier | 0.743712 | 0.802375 | 0.587160 | 0.604880 | 0.595834 | 2.182848 | 0.024007 |
DummyClassifier | 0.500000 | 0.759250 | 0.000000 | 0.000000 | 0.000000 | 0.058244 | 0.028908 |
The results table is nice, but we can also plot the metrics we’re interested in.
Since there’s imbalance in y
, we look at the ROC AUC and F1 score as well as accuracy.
=["f1", "roc_auc", "accuracy"], kind="bar") pnd.plot.metrics(metrics
=["HistGradientBoostingClassifier", "XGBClassifier"]) pnd.plot.roc_curve(estimator_names
Performance between HistGradientBoostingClassifier and XGBoost is very similar. We can go ahead and try to squeeze a bit more using hyperparameter tuning with PoniardBaseEstimator.tune_estimator
.
Note that we’re building custom parameter grids, since the grids included in Poniard are smaller and better suited to full grid search, instead of the random search we will use.
= "HistGradientBoostingClassifier"
name = {
grid f"{name}__learning_rate": np.arange(0.1, 1.1, 0.1),
f"{name}__max_iter": np.arange(100, 520, 20),
f"{name}__max_leaf_nodes": np.arange(10, 110, 10),
f"{name}__l2_regularization": np.arange(0, 1.1, 0.1),
}=grid, mode="random", n_iter=100) pnd.tune_estimator(name, grid
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
To keep it reasonable, we tune XGBoost for 50 rounds as it takes a lot longer than HGB.
= "XGBClassifier"
name = {
grid f"{name}__n_estimators": np.arange(100, 520, 20),
f"{name}__max_depth": np.arange(2, 21, 2),
f"{name}__learning_rate": np.arange(0.1, 1.1, 0.1),
f"{name}__min_child_weight": np.arange(1, 11, 1),
}=grid, mode="random", n_iter=50) pnd.tune_estimator(name, grid
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
After tuning, the tuned estimators are added to PoniardBaseEstimator.pipelines
, but they still need to be fit with CV. Calling PoniardBaseEstimator.fit
will only run cross validation for new estimators only.
pnd.fit() pnd.get_results()
test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
---|---|---|---|---|---|---|---|
HistGradientBoostingClassifier_tuned | 0.917760 | 0.862625 | 0.757019 | 0.631896 | 0.688738 | 0.503392 | 0.034052 |
XGBClassifier_tuned | 0.916398 | 0.858750 | 0.744218 | 0.629814 | 0.682057 | 1.103070 | 0.022097 |
HistGradientBoostingClassifier | 0.915932 | 0.860000 | 0.750181 | 0.627740 | 0.683455 | 0.943964 | 0.059188 |
XGBClassifier | 0.914350 | 0.859750 | 0.742949 | 0.638133 | 0.686398 | 2.057690 | 0.036659 |
LogisticRegression | 0.898562 | 0.843250 | 0.714732 | 0.579968 | 0.640167 | 0.193686 | 0.027372 |
RandomForestClassifier | 0.884420 | 0.835750 | 0.680117 | 0.600215 | 0.637618 | 6.557138 | 0.059342 |
KNeighborsClassifier | 0.842528 | 0.815125 | 0.629923 | 0.562315 | 0.594060 | 0.058123 | 0.229496 |
GaussianNB | 0.808289 | 0.583500 | 0.361538 | 0.931476 | 0.520086 | 0.055644 | 0.024651 |
DecisionTreeClassifier | 0.743712 | 0.802375 | 0.587160 | 0.604880 | 0.595834 | 2.182848 | 0.024007 |
DummyClassifier | 0.500000 | 0.759250 | 0.000000 | 0.000000 | 0.000000 | 0.058244 | 0.028908 |
From here we can get a summary of a given estimator’s performance with PoniardBaseEstimator.analyze_estimator
.
"HistGradientBoostingClassifier_tuned") pnd.analyze_estimator(
We can also understand where errors are coming from by leveraging ErrorAnalyzer
.
= ErrorAnalyzer.from_poniard(
analyzer ="HistGradientBoostingClassifier_tuned"
pnd, estimator_names
)= analyzer.rank_errors()["HistGradientBoostingClassifier_tuned"]
errors "values"] errors[
y | prediction | proba_0 | proba_1 | error | |
---|---|---|---|---|---|
5953 | 1 | 0 | 0.999105 | 0.000895 | 0.999105 |
29128 | 1 | 0 | 0.997760 | 0.002240 | 0.997760 |
786 | 0 | 1 | 0.003175 | 0.996825 | 0.996825 |
11144 | 1 | 0 | 0.996473 | 0.003527 | 0.996473 |
2004 | 1 | 0 | 0.995659 | 0.004341 | 0.995659 |
... | ... | ... | ... | ... | ... |
37335 | 0 | 1 | 0.498782 | 0.501218 | 0.501218 |
17237 | 1 | 0 | 0.501033 | 0.498967 | 0.501033 |
29824 | 1 | 0 | 0.500940 | 0.499060 | 0.500940 |
18163 | 1 | 0 | 0.500752 | 0.499248 | 0.500752 |
35847 | 0 | 1 | 0.499532 | 0.500468 | 0.500468 |
1099 rows × 5 columns
"idx"], as_ratio=True) analyzer.analyze_target(errors[
0_errors | 0_target | |
---|---|---|
class | ||
1 | 0.645132 | 0.24075 |
0 | 0.354868 | 0.75925 |
"idx"], wrt_target=True) analyzer.analyze_target(errors[
class
1 0.368120
0 0.064208
dtype: float64
As expected, the model’s errors are concentrated on the rare class.
We can also get a glimpse on how the features differ between errors and non-errors.
= analyzer.analyze_features(errors["idx"])
feature_analysis "marital-status"] feature_analysis[
marital-status | Divorced | Married-AF-spouse | Married-civ-spouse | Married-spouse-absent | Never-married | Separated | Widowed |
---|---|---|---|---|---|---|---|
error | |||||||
0 | 0.145921 | 0.000435 | 0.397044 | 0.011013 | 0.373134 | 0.037096 | 0.035357 |
1 | 0.068244 | 0.000910 | 0.806187 | 0.003640 | 0.094631 | 0.010009 | 0.016379 |
"age"] feature_analysis[
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
error | ||||||||
0 | 6901.0 | 37.824518 | 13.926955 | 17.0 | 26.0 | 36.0 | 47.0 | 90.0 |
1 | 1099.0 | 43.348499 | 11.097495 | 18.0 | 35.0 | 42.0 | 50.0 | 90.0 |
"capital-gain"] feature_analysis[
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
error | ||||||||
0 | 6901.0 | 1265.274743 | 8104.97948 | 0.0 | 0.0 | 0.0 | 0.0 | 99999.0 |
1 | 1099.0 | 184.174704 | 2052.24528 | 0.0 | 0.0 | 0.0 | 0.0 | 41310.0 |
We might benefit from building an ensemble from multiple estimators. However, we’d like the estimators to complement each other, i.e, make different mistakes. This is when PoniardBaseEstimator.get_predictions_similarity
can help.
=True) pnd.get_predictions_similarity(on_errors
LogisticRegression | GaussianNB | KNeighborsClassifier | DecisionTreeClassifier | RandomForestClassifier | HistGradientBoostingClassifier | XGBClassifier | HistGradientBoostingClassifier_tuned | XGBClassifier_tuned | |
---|---|---|---|---|---|---|---|---|---|
LogisticRegression | 1.000000 | 0.016189 | 0.578868 | 0.464133 | 0.600827 | 0.688545 | 0.655039 | 0.711309 | 0.690366 |
GaussianNB | 0.016189 | 1.000000 | 0.050041 | 0.065285 | 0.047443 | 0.014467 | 0.017282 | 0.018000 | 0.024393 |
KNeighborsClassifier | 0.578868 | 0.050041 | 1.000000 | 0.480397 | 0.643573 | 0.582190 | 0.562862 | 0.568010 | 0.569996 |
DecisionTreeClassifier | 0.464133 | 0.065285 | 0.480397 | 1.000000 | 0.618275 | 0.529288 | 0.533057 | 0.505370 | 0.496691 |
RandomForestClassifier | 0.600827 | 0.047443 | 0.643573 | 0.618275 | 1.000000 | 0.659699 | 0.642367 | 0.627170 | 0.635312 |
HistGradientBoostingClassifier | 0.688545 | 0.014467 | 0.582190 | 0.529288 | 0.659699 | 1.000000 | 0.846959 | 0.877606 | 0.833977 |
XGBClassifier | 0.655039 | 0.017282 | 0.562862 | 0.533057 | 0.642367 | 0.846959 | 1.000000 | 0.844242 | 0.813421 |
HistGradientBoostingClassifier_tuned | 0.711309 | 0.018000 | 0.568010 | 0.505370 | 0.627170 | 0.877606 | 0.844242 | 1.000000 | 0.839548 |
XGBClassifier_tuned | 0.690366 | 0.024393 | 0.569996 | 0.496691 | 0.635312 | 0.833977 | 0.813421 | 0.839548 | 1.000000 |
Analyzing the results so far and the similarity table, it looks like including LogisticRegression along with the gradient boosters could be a good idea, so we’ll go ahead and tune LR.
= "LogisticRegression"
name = {
grid f"{name}__C": np.geomspace(0.1, 100, 200),
f"{name}__class_weight": [None, "balanced"],
}=grid, mode="random", n_iter=100) pnd.tune_estimator(name, grid
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
Next, we build a voting classifier with equal weights.
pnd.build_ensemble(=[
estimator_names"HistGradientBoostingClassifier_tuned",
"XGBClassifier_tuned",
"LogisticRegression_tuned",
],="voting",
method="soft",
voting
)
pnd.fit() pnd.get_results()
test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
---|---|---|---|---|---|---|---|
HistGradientBoostingClassifier_tuned | 0.917760 | 0.862625 | 0.757019 | 0.631896 | 0.688738 | 0.503392 | 0.034052 |
VotingClassifier | 0.917293 | 0.863500 | 0.765881 | 0.623587 | 0.687299 | 2.131655 | 0.044481 |
XGBClassifier_tuned | 0.916398 | 0.858750 | 0.744218 | 0.629814 | 0.682057 | 1.103070 | 0.022097 |
HistGradientBoostingClassifier | 0.915932 | 0.860000 | 0.750181 | 0.627740 | 0.683455 | 0.943964 | 0.059188 |
XGBClassifier | 0.914350 | 0.859750 | 0.742949 | 0.638133 | 0.686398 | 2.057690 | 0.036659 |
LogisticRegression_tuned | 0.898678 | 0.842875 | 0.717942 | 0.571660 | 0.636357 | 0.131910 | 0.026045 |
LogisticRegression | 0.898562 | 0.843250 | 0.714732 | 0.579968 | 0.640167 | 0.193686 | 0.027372 |
RandomForestClassifier | 0.884420 | 0.835750 | 0.680117 | 0.600215 | 0.637618 | 6.557138 | 0.059342 |
KNeighborsClassifier | 0.842528 | 0.815125 | 0.629923 | 0.562315 | 0.594060 | 0.058123 | 0.229496 |
GaussianNB | 0.808289 | 0.583500 | 0.361538 | 0.931476 | 0.520086 | 0.055644 | 0.024651 |
DecisionTreeClassifier | 0.743712 | 0.802375 | 0.587160 | 0.604880 | 0.595834 | 2.182848 | 0.024007 |
DummyClassifier | 0.500000 | 0.759250 | 0.000000 | 0.000000 | 0.000000 | 0.058244 | 0.028908 |
We could even tune the ensemble weights.
= []
weights
for _ in range(200):
= np.random.uniform(0, 1, size=3)
unscaled = unscaled / np.sum(unscaled)
scaled
weights.append(scaled.tolist())
pnd.tune_estimator("VotingClassifier",
={"VotingClassifier__weights": weights},
grid="random",
mode=10,
n_iter )
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
pnd.fit() pnd.get_results()
test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
---|---|---|---|---|---|---|---|
VotingClassifier_tuned | 0.918799 | 0.862625 | 0.760029 | 0.627224 | 0.687134 | 2.107462 | 0.050291 |
HistGradientBoostingClassifier_tuned | 0.917760 | 0.862625 | 0.757019 | 0.631896 | 0.688738 | 0.503392 | 0.034052 |
VotingClassifier | 0.917293 | 0.863500 | 0.765881 | 0.623587 | 0.687299 | 2.131655 | 0.044481 |
XGBClassifier_tuned | 0.916398 | 0.858750 | 0.744218 | 0.629814 | 0.682057 | 1.103070 | 0.022097 |
HistGradientBoostingClassifier | 0.915932 | 0.860000 | 0.750181 | 0.627740 | 0.683455 | 0.943964 | 0.059188 |
XGBClassifier | 0.914350 | 0.859750 | 0.742949 | 0.638133 | 0.686398 | 2.057690 | 0.036659 |
LogisticRegression_tuned | 0.898678 | 0.842875 | 0.717942 | 0.571660 | 0.636357 | 0.131910 | 0.026045 |
LogisticRegression | 0.898562 | 0.843250 | 0.714732 | 0.579968 | 0.640167 | 0.193686 | 0.027372 |
RandomForestClassifier | 0.884420 | 0.835750 | 0.680117 | 0.600215 | 0.637618 | 6.557138 | 0.059342 |
KNeighborsClassifier | 0.842528 | 0.815125 | 0.629923 | 0.562315 | 0.594060 | 0.058123 | 0.229496 |
GaussianNB | 0.808289 | 0.583500 | 0.361538 | 0.931476 | 0.520086 | 0.055644 | 0.024651 |
DecisionTreeClassifier | 0.743712 | 0.802375 | 0.587160 | 0.604880 | 0.595834 | 2.182848 | 0.024007 |
DummyClassifier | 0.500000 | 0.759250 | 0.000000 | 0.000000 | 0.000000 | 0.058244 | 0.028908 |
"VotingClassifier_tuned", normalize="true") pnd.plot.confusion_matrix(
"VotingClassifier_tuned") pnd.plot.permutation_importance(
"VotingClassifier_tuned", "marital-status") pnd.plot.partial_dependence(
= pnd.get_estimator("VotingClassifier_tuned", retrain=True)
final final
Pipeline(steps=[('preprocessor', Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age', 'capital-gain', 'capital-loss', 'hours-per-week']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(st... predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', use_label_encoder=False, validate_parameters=1, verbosity=0)), ('LogisticRegression_tuned', LogisticRegression(C=0.3255088599835058, max_iter=5000, random_state=0))], verbose=0, voting='soft', weights=[0.5963099656272408, 0.2756042875483777, 0.1280857468243813]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor', Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age', 'capital-gain', 'capital-loss', 'hours-per-week']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(st... predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', use_label_encoder=False, validate_parameters=1, verbosity=0)), ('LogisticRegression_tuned', LogisticRegression(C=0.3255088599835058, max_iter=5000, random_state=0))], verbose=0, voting='soft', weights=[0.5963099656272408, 0.2756042875483777, 0.1280857468243813]))])
Pipeline(steps=[('type_preprocessor', ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age', 'capital-gain', 'capital-loss', 'hours-per-week']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-h... ['education-num', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='classification'))]), ['native-country'])])), ('remove_invariant', VarianceThreshold())])
ColumnTransformer(transformers=[('numeric_preprocessor', Pipeline(steps=[('numeric_imputer', SimpleImputer()), ('scaler', StandardScaler())]), ['age', 'capital-gain', 'capital-loss', 'hours-per-week']), ('categorical_low_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('one-hot_encoder', OneHotEncoder(drop='if_bi... handle_unknown='ignore', sparse=False))]), ['education-num', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex']), ('categorical_high_preprocessor', Pipeline(steps=[('categorical_imputer', SimpleImputer(strategy='most_frequent')), ('high_cardinality_encoder', TargetEncoder(handle_unknown='ignore', task='classification'))]), ['native-country'])])
['age', 'capital-gain', 'capital-loss', 'hours-per-week']
SimpleImputer()
StandardScaler()
['education-num', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False)
['native-country']
SimpleImputer(strategy='most_frequent')
TargetEncoder(handle_unknown='ignore', task='classification')
VarianceThreshold()
VotingClassifier(estimators=[('HistGradientBoostingClassifier_tuned', HistGradientBoostingClassifier(l2_regularization=0.2, learning_rate=0.2, max_iter=120, max_leaf_nodes=10, random_state=0)), ('XGBClassifier_tuned', XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gp... predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', use_label_encoder=False, validate_parameters=1, verbosity=0)), ('LogisticRegression_tuned', LogisticRegression(C=0.3255088599835058, max_iter=5000, random_state=0))], verbose=0, voting='soft', weights=[0.5963099656272408, 0.2756042875483777, 0.1280857468243813])
HistGradientBoostingClassifier(l2_regularization=0.2, learning_rate=0.2, max_iter=120, max_leaf_nodes=10, random_state=0)
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.5, max_delta_step=0, max_depth=4, min_child_weight=7, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=1, num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', use_label_encoder=False, validate_parameters=1, verbosity=0)
LogisticRegression(C=0.3255088599835058, max_iter=5000, random_state=0)
print(classification_report(y_test, final.predict(X_test), digits=3))
precision recall f1-score support
0 0.896 0.946 0.920 1518
1 0.793 0.654 0.717 482
accuracy 0.875 2000
macro avg 0.845 0.800 0.818 2000
weighted avg 0.871 0.875 0.871 2000