import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from poniard import PoniardClassifier
from poniard.plugins import WandBPlugin, PandasProfilingPlugin
from poniard.error_analysis import ErrorAnalyzerA slightly more complex case study
We’ll get the data and sample 10.000 observations to speed up training. Also, we’re going to cast pd.Categorical columns to object since scikit-learn doesn’t play will with them, drop the “fnlwgt” column as it refers to survey weights that should not provide any relevant information, and cast the target to 1-0.
# Adult Census dataset
X, y = fetch_openml(data_id=1590, return_X_y=True, as_frame=True)
X = X.sample(n=10000, random_state=0).drop("fnlwgt", axis=1)
y = y.reindex(X.index)
category_cols = X.select_dtypes(include="category").columns
X = X.astype({col: object for col in category_cols})
y = y.replace({">50K": 1, "<=50K": 0})Next we split and pass only the training data to Poniard.
We’ll be using the 2 available plugins.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)
pnd = PoniardClassifier(
n_jobs=-1,
plugins=[WandBPlugin(project="adult-demo"), PandasProfilingPlugin()],
)
pnd.setup(X_train, y_train)
pnd.remove_estimators("SVC") # Doesn't scale nicelywandb: Currently logged in as: rxavier. Use `wandb login --relogin` to force relogin
/Users/rafxavier/Documents/Repos/personal/poniard/nbs/guide/wandb/run-20221203_231425-1n5yya80
Setup info
Target
Type: binary
Shape: (8000,)
Unique values: 2
Metrics
Main metric: roc_aucFeature type inference
Minimum unique values to consider a number-like feature numeric: 800
Minimum unique values to consider a categorical feature high cardinality: 20
Inferred feature types:
| numeric | categorical_high | categorical_low | datetime | |
|---|---|---|---|---|
| 0 | age | education-num | ||
| 1 | capital-gain | workclass | ||
| 2 | capital-loss | education | ||
| 3 | hours-per-week | marital-status | ||
| 4 | native-country | occupation | ||
| 5 | relationship | |||
| 6 | race | |||
| 7 | sex |
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
As can be seen, Pandas Profiling already created a report and saved it to the default location. If ipywidgets is installed, the report will be included in the output.
Meanwhile, Weights and Biases either logged in or prompted for a login, and started logging information about the run (preprocessor HTML representation, dataset, inferred types). Also, because plugins can check whether other plugins are included in a Poniard estimator (by using BasePlugin._check_plugin_used), wandb also uploaded the profile report.
Right away there’s some misclassified features, so we’ll reassign them.
pnd.reassign_types(numeric=["age", "capital-gain", "capital-loss", "hours-per-week"])Assigned feature types:
| numeric | categorical_high | categorical_low | datetime | |
|---|---|---|---|---|
| 0 | age | native-country | education-num | |
| 1 | capital-gain | workclass | ||
| 2 | capital-loss | education | ||
| 3 | hours-per-week | marital-status | ||
| 4 | occupation | |||
| 5 | relationship | |||
| 6 | race | |||
| 7 | sex |
PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
pnd.fit()
pnd.get_results()wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: Network error (ReadTimeout), entering retry loop.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
wandb: WARNING A graphql request initiated by the public wandb API timed out (timeout=9 sec). Create a new API with an integer timeout larger than 9, e.g., `api = wandb.Api(timeout=19)` to increase the graphql timeout.
| test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
|---|---|---|---|---|---|---|---|
| HistGradientBoostingClassifier | 0.915932 | 0.860000 | 0.750181 | 0.627740 | 0.683455 | 0.943964 | 0.059188 |
| XGBClassifier | 0.914350 | 0.859750 | 0.742949 | 0.638133 | 0.686398 | 2.057690 | 0.036659 |
| LogisticRegression | 0.898562 | 0.843250 | 0.714732 | 0.579968 | 0.640167 | 0.193686 | 0.027372 |
| RandomForestClassifier | 0.884420 | 0.835750 | 0.680117 | 0.600215 | 0.637618 | 6.557138 | 0.059342 |
| KNeighborsClassifier | 0.842528 | 0.815125 | 0.629923 | 0.562315 | 0.594060 | 0.058123 | 0.229496 |
| GaussianNB | 0.808289 | 0.583500 | 0.361538 | 0.931476 | 0.520086 | 0.055644 | 0.024651 |
| DecisionTreeClassifier | 0.743712 | 0.802375 | 0.587160 | 0.604880 | 0.595834 | 2.182848 | 0.024007 |
| DummyClassifier | 0.500000 | 0.759250 | 0.000000 | 0.000000 | 0.000000 | 0.058244 | 0.028908 |
The results table is nice, but we can also plot the metrics we’re interested in.
Since there’s imbalance in y, we look at the ROC AUC and F1 score as well as accuracy.
pnd.plot.metrics(metrics=["f1", "roc_auc", "accuracy"], kind="bar")pnd.plot.roc_curve(estimator_names=["HistGradientBoostingClassifier", "XGBClassifier"])Performance between HistGradientBoostingClassifier and XGBoost is very similar. We can go ahead and try to squeeze a bit more using hyperparameter tuning with PoniardBaseEstimator.tune_estimator.
Note that we’re building custom parameter grids, since the grids included in Poniard are smaller and better suited to full grid search, instead of the random search we will use.
name = "HistGradientBoostingClassifier"
grid = {
f"{name}__learning_rate": np.arange(0.1, 1.1, 0.1),
f"{name}__max_iter": np.arange(100, 520, 20),
f"{name}__max_leaf_nodes": np.arange(10, 110, 10),
f"{name}__l2_regularization": np.arange(0, 1.1, 0.1),
}
pnd.tune_estimator(name, grid=grid, mode="random", n_iter=100)PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
To keep it reasonable, we tune XGBoost for 50 rounds as it takes a lot longer than HGB.
name = "XGBClassifier"
grid = {
f"{name}__n_estimators": np.arange(100, 520, 20),
f"{name}__max_depth": np.arange(2, 21, 2),
f"{name}__learning_rate": np.arange(0.1, 1.1, 0.1),
f"{name}__min_child_weight": np.arange(1, 11, 1),
}
pnd.tune_estimator(name, grid=grid, mode="random", n_iter=50)PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
After tuning, the tuned estimators are added to PoniardBaseEstimator.pipelines, but they still need to be fit with CV. Calling PoniardBaseEstimator.fit will only run cross validation for new estimators only.
pnd.fit()
pnd.get_results()| test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
|---|---|---|---|---|---|---|---|
| HistGradientBoostingClassifier_tuned | 0.917760 | 0.862625 | 0.757019 | 0.631896 | 0.688738 | 0.503392 | 0.034052 |
| XGBClassifier_tuned | 0.916398 | 0.858750 | 0.744218 | 0.629814 | 0.682057 | 1.103070 | 0.022097 |
| HistGradientBoostingClassifier | 0.915932 | 0.860000 | 0.750181 | 0.627740 | 0.683455 | 0.943964 | 0.059188 |
| XGBClassifier | 0.914350 | 0.859750 | 0.742949 | 0.638133 | 0.686398 | 2.057690 | 0.036659 |
| LogisticRegression | 0.898562 | 0.843250 | 0.714732 | 0.579968 | 0.640167 | 0.193686 | 0.027372 |
| RandomForestClassifier | 0.884420 | 0.835750 | 0.680117 | 0.600215 | 0.637618 | 6.557138 | 0.059342 |
| KNeighborsClassifier | 0.842528 | 0.815125 | 0.629923 | 0.562315 | 0.594060 | 0.058123 | 0.229496 |
| GaussianNB | 0.808289 | 0.583500 | 0.361538 | 0.931476 | 0.520086 | 0.055644 | 0.024651 |
| DecisionTreeClassifier | 0.743712 | 0.802375 | 0.587160 | 0.604880 | 0.595834 | 2.182848 | 0.024007 |
| DummyClassifier | 0.500000 | 0.759250 | 0.000000 | 0.000000 | 0.000000 | 0.058244 | 0.028908 |
From here we can get a summary of a given estimator’s performance with PoniardBaseEstimator.analyze_estimator.
pnd.analyze_estimator("HistGradientBoostingClassifier_tuned")We can also understand where errors are coming from by leveraging ErrorAnalyzer.
analyzer = ErrorAnalyzer.from_poniard(
pnd, estimator_names="HistGradientBoostingClassifier_tuned"
)
errors = analyzer.rank_errors()["HistGradientBoostingClassifier_tuned"]
errors["values"]| y | prediction | proba_0 | proba_1 | error | |
|---|---|---|---|---|---|
| 5953 | 1 | 0 | 0.999105 | 0.000895 | 0.999105 |
| 29128 | 1 | 0 | 0.997760 | 0.002240 | 0.997760 |
| 786 | 0 | 1 | 0.003175 | 0.996825 | 0.996825 |
| 11144 | 1 | 0 | 0.996473 | 0.003527 | 0.996473 |
| 2004 | 1 | 0 | 0.995659 | 0.004341 | 0.995659 |
| ... | ... | ... | ... | ... | ... |
| 37335 | 0 | 1 | 0.498782 | 0.501218 | 0.501218 |
| 17237 | 1 | 0 | 0.501033 | 0.498967 | 0.501033 |
| 29824 | 1 | 0 | 0.500940 | 0.499060 | 0.500940 |
| 18163 | 1 | 0 | 0.500752 | 0.499248 | 0.500752 |
| 35847 | 0 | 1 | 0.499532 | 0.500468 | 0.500468 |
1099 rows × 5 columns
analyzer.analyze_target(errors["idx"], as_ratio=True)| 0_errors | 0_target | |
|---|---|---|
| class | ||
| 1 | 0.645132 | 0.24075 |
| 0 | 0.354868 | 0.75925 |
analyzer.analyze_target(errors["idx"], wrt_target=True)class
1 0.368120
0 0.064208
dtype: float64
As expected, the model’s errors are concentrated on the rare class.
We can also get a glimpse on how the features differ between errors and non-errors.
feature_analysis = analyzer.analyze_features(errors["idx"])
feature_analysis["marital-status"]| marital-status | Divorced | Married-AF-spouse | Married-civ-spouse | Married-spouse-absent | Never-married | Separated | Widowed |
|---|---|---|---|---|---|---|---|
| error | |||||||
| 0 | 0.145921 | 0.000435 | 0.397044 | 0.011013 | 0.373134 | 0.037096 | 0.035357 |
| 1 | 0.068244 | 0.000910 | 0.806187 | 0.003640 | 0.094631 | 0.010009 | 0.016379 |
feature_analysis["age"]| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| error | ||||||||
| 0 | 6901.0 | 37.824518 | 13.926955 | 17.0 | 26.0 | 36.0 | 47.0 | 90.0 |
| 1 | 1099.0 | 43.348499 | 11.097495 | 18.0 | 35.0 | 42.0 | 50.0 | 90.0 |
feature_analysis["capital-gain"]| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| error | ||||||||
| 0 | 6901.0 | 1265.274743 | 8104.97948 | 0.0 | 0.0 | 0.0 | 0.0 | 99999.0 |
| 1 | 1099.0 | 184.174704 | 2052.24528 | 0.0 | 0.0 | 0.0 | 0.0 | 41310.0 |
We might benefit from building an ensemble from multiple estimators. However, we’d like the estimators to complement each other, i.e, make different mistakes. This is when PoniardBaseEstimator.get_predictions_similarity can help.
pnd.get_predictions_similarity(on_errors=True)| LogisticRegression | GaussianNB | KNeighborsClassifier | DecisionTreeClassifier | RandomForestClassifier | HistGradientBoostingClassifier | XGBClassifier | HistGradientBoostingClassifier_tuned | XGBClassifier_tuned | |
|---|---|---|---|---|---|---|---|---|---|
| LogisticRegression | 1.000000 | 0.016189 | 0.578868 | 0.464133 | 0.600827 | 0.688545 | 0.655039 | 0.711309 | 0.690366 |
| GaussianNB | 0.016189 | 1.000000 | 0.050041 | 0.065285 | 0.047443 | 0.014467 | 0.017282 | 0.018000 | 0.024393 |
| KNeighborsClassifier | 0.578868 | 0.050041 | 1.000000 | 0.480397 | 0.643573 | 0.582190 | 0.562862 | 0.568010 | 0.569996 |
| DecisionTreeClassifier | 0.464133 | 0.065285 | 0.480397 | 1.000000 | 0.618275 | 0.529288 | 0.533057 | 0.505370 | 0.496691 |
| RandomForestClassifier | 0.600827 | 0.047443 | 0.643573 | 0.618275 | 1.000000 | 0.659699 | 0.642367 | 0.627170 | 0.635312 |
| HistGradientBoostingClassifier | 0.688545 | 0.014467 | 0.582190 | 0.529288 | 0.659699 | 1.000000 | 0.846959 | 0.877606 | 0.833977 |
| XGBClassifier | 0.655039 | 0.017282 | 0.562862 | 0.533057 | 0.642367 | 0.846959 | 1.000000 | 0.844242 | 0.813421 |
| HistGradientBoostingClassifier_tuned | 0.711309 | 0.018000 | 0.568010 | 0.505370 | 0.627170 | 0.877606 | 0.844242 | 1.000000 | 0.839548 |
| XGBClassifier_tuned | 0.690366 | 0.024393 | 0.569996 | 0.496691 | 0.635312 | 0.833977 | 0.813421 | 0.839548 | 1.000000 |
Analyzing the results so far and the similarity table, it looks like including LogisticRegression along with the gradient boosters could be a good idea, so we’ll go ahead and tune LR.
name = "LogisticRegression"
grid = {
f"{name}__C": np.geomspace(0.1, 100, 200),
f"{name}__class_weight": [None, "balanced"],
}
pnd.tune_estimator(name, grid=grid, mode="random", n_iter=100)PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
Next, we build a voting classifier with equal weights.
pnd.build_ensemble(
estimator_names=[
"HistGradientBoostingClassifier_tuned",
"XGBClassifier_tuned",
"LogisticRegression_tuned",
],
method="voting",
voting="soft",
)
pnd.fit()
pnd.get_results()| test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
|---|---|---|---|---|---|---|---|
| HistGradientBoostingClassifier_tuned | 0.917760 | 0.862625 | 0.757019 | 0.631896 | 0.688738 | 0.503392 | 0.034052 |
| VotingClassifier | 0.917293 | 0.863500 | 0.765881 | 0.623587 | 0.687299 | 2.131655 | 0.044481 |
| XGBClassifier_tuned | 0.916398 | 0.858750 | 0.744218 | 0.629814 | 0.682057 | 1.103070 | 0.022097 |
| HistGradientBoostingClassifier | 0.915932 | 0.860000 | 0.750181 | 0.627740 | 0.683455 | 0.943964 | 0.059188 |
| XGBClassifier | 0.914350 | 0.859750 | 0.742949 | 0.638133 | 0.686398 | 2.057690 | 0.036659 |
| LogisticRegression_tuned | 0.898678 | 0.842875 | 0.717942 | 0.571660 | 0.636357 | 0.131910 | 0.026045 |
| LogisticRegression | 0.898562 | 0.843250 | 0.714732 | 0.579968 | 0.640167 | 0.193686 | 0.027372 |
| RandomForestClassifier | 0.884420 | 0.835750 | 0.680117 | 0.600215 | 0.637618 | 6.557138 | 0.059342 |
| KNeighborsClassifier | 0.842528 | 0.815125 | 0.629923 | 0.562315 | 0.594060 | 0.058123 | 0.229496 |
| GaussianNB | 0.808289 | 0.583500 | 0.361538 | 0.931476 | 0.520086 | 0.055644 | 0.024651 |
| DecisionTreeClassifier | 0.743712 | 0.802375 | 0.587160 | 0.604880 | 0.595834 | 2.182848 | 0.024007 |
| DummyClassifier | 0.500000 | 0.759250 | 0.000000 | 0.000000 | 0.000000 | 0.058244 | 0.028908 |
We could even tune the ensemble weights.
weights = []
for _ in range(200):
unscaled = np.random.uniform(0, 1, size=3)
scaled = unscaled / np.sum(unscaled)
weights.append(scaled.tolist())
pnd.tune_estimator(
"VotingClassifier",
grid={"VotingClassifier__weights": weights},
mode="random",
n_iter=10,
)PoniardClassifier(n_jobs=-1, plugins=[WandBPlugin(project='adult-demo'), PandasProfilingPlugin()])
pnd.fit()
pnd.get_results()| test_roc_auc | test_accuracy | test_precision | test_recall | test_f1 | fit_time | score_time | |
|---|---|---|---|---|---|---|---|
| VotingClassifier_tuned | 0.918799 | 0.862625 | 0.760029 | 0.627224 | 0.687134 | 2.107462 | 0.050291 |
| HistGradientBoostingClassifier_tuned | 0.917760 | 0.862625 | 0.757019 | 0.631896 | 0.688738 | 0.503392 | 0.034052 |
| VotingClassifier | 0.917293 | 0.863500 | 0.765881 | 0.623587 | 0.687299 | 2.131655 | 0.044481 |
| XGBClassifier_tuned | 0.916398 | 0.858750 | 0.744218 | 0.629814 | 0.682057 | 1.103070 | 0.022097 |
| HistGradientBoostingClassifier | 0.915932 | 0.860000 | 0.750181 | 0.627740 | 0.683455 | 0.943964 | 0.059188 |
| XGBClassifier | 0.914350 | 0.859750 | 0.742949 | 0.638133 | 0.686398 | 2.057690 | 0.036659 |
| LogisticRegression_tuned | 0.898678 | 0.842875 | 0.717942 | 0.571660 | 0.636357 | 0.131910 | 0.026045 |
| LogisticRegression | 0.898562 | 0.843250 | 0.714732 | 0.579968 | 0.640167 | 0.193686 | 0.027372 |
| RandomForestClassifier | 0.884420 | 0.835750 | 0.680117 | 0.600215 | 0.637618 | 6.557138 | 0.059342 |
| KNeighborsClassifier | 0.842528 | 0.815125 | 0.629923 | 0.562315 | 0.594060 | 0.058123 | 0.229496 |
| GaussianNB | 0.808289 | 0.583500 | 0.361538 | 0.931476 | 0.520086 | 0.055644 | 0.024651 |
| DecisionTreeClassifier | 0.743712 | 0.802375 | 0.587160 | 0.604880 | 0.595834 | 2.182848 | 0.024007 |
| DummyClassifier | 0.500000 | 0.759250 | 0.000000 | 0.000000 | 0.000000 | 0.058244 | 0.028908 |
pnd.plot.confusion_matrix("VotingClassifier_tuned", normalize="true")pnd.plot.permutation_importance("VotingClassifier_tuned")pnd.plot.partial_dependence("VotingClassifier_tuned", "marital-status")final = pnd.get_estimator("VotingClassifier_tuned", retrain=True)
finalPipeline(steps=[('preprocessor',
Pipeline(steps=[('type_preprocessor',
ColumnTransformer(transformers=[('numeric_preprocessor',
Pipeline(steps=[('numeric_imputer',
SimpleImputer()),
('scaler',
StandardScaler())]),
['age',
'capital-gain',
'capital-loss',
'hours-per-week']),
('categorical_low_preprocessor',
Pipeline(steps=[('categorical_imputer',
SimpleImputer(st...
predictor='auto',
random_state=0,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
subsample=1,
tree_method='exact',
use_label_encoder=False,
validate_parameters=1,
verbosity=0)),
('LogisticRegression_tuned',
LogisticRegression(C=0.3255088599835058,
max_iter=5000,
random_state=0))],
verbose=0, voting='soft',
weights=[0.5963099656272408,
0.2756042875483777,
0.1280857468243813]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('preprocessor',
Pipeline(steps=[('type_preprocessor',
ColumnTransformer(transformers=[('numeric_preprocessor',
Pipeline(steps=[('numeric_imputer',
SimpleImputer()),
('scaler',
StandardScaler())]),
['age',
'capital-gain',
'capital-loss',
'hours-per-week']),
('categorical_low_preprocessor',
Pipeline(steps=[('categorical_imputer',
SimpleImputer(st...
predictor='auto',
random_state=0,
reg_alpha=0,
reg_lambda=1,
scale_pos_weight=1,
subsample=1,
tree_method='exact',
use_label_encoder=False,
validate_parameters=1,
verbosity=0)),
('LogisticRegression_tuned',
LogisticRegression(C=0.3255088599835058,
max_iter=5000,
random_state=0))],
verbose=0, voting='soft',
weights=[0.5963099656272408,
0.2756042875483777,
0.1280857468243813]))])Pipeline(steps=[('type_preprocessor',
ColumnTransformer(transformers=[('numeric_preprocessor',
Pipeline(steps=[('numeric_imputer',
SimpleImputer()),
('scaler',
StandardScaler())]),
['age', 'capital-gain',
'capital-loss',
'hours-per-week']),
('categorical_low_preprocessor',
Pipeline(steps=[('categorical_imputer',
SimpleImputer(strategy='most_frequent')),
('one-h...
['education-num', 'workclass',
'education',
'marital-status',
'occupation', 'relationship',
'race', 'sex']),
('categorical_high_preprocessor',
Pipeline(steps=[('categorical_imputer',
SimpleImputer(strategy='most_frequent')),
('high_cardinality_encoder',
TargetEncoder(handle_unknown='ignore',
task='classification'))]),
['native-country'])])),
('remove_invariant', VarianceThreshold())])ColumnTransformer(transformers=[('numeric_preprocessor',
Pipeline(steps=[('numeric_imputer',
SimpleImputer()),
('scaler', StandardScaler())]),
['age', 'capital-gain', 'capital-loss',
'hours-per-week']),
('categorical_low_preprocessor',
Pipeline(steps=[('categorical_imputer',
SimpleImputer(strategy='most_frequent')),
('one-hot_encoder',
OneHotEncoder(drop='if_bi...
handle_unknown='ignore',
sparse=False))]),
['education-num', 'workclass', 'education',
'marital-status', 'occupation',
'relationship', 'race', 'sex']),
('categorical_high_preprocessor',
Pipeline(steps=[('categorical_imputer',
SimpleImputer(strategy='most_frequent')),
('high_cardinality_encoder',
TargetEncoder(handle_unknown='ignore',
task='classification'))]),
['native-country'])])['age', 'capital-gain', 'capital-loss', 'hours-per-week']
SimpleImputer()
StandardScaler()
['education-num', 'workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(drop='if_binary', handle_unknown='ignore', sparse=False)
['native-country']
SimpleImputer(strategy='most_frequent')
TargetEncoder(handle_unknown='ignore', task='classification')
VarianceThreshold()
VotingClassifier(estimators=[('HistGradientBoostingClassifier_tuned',
HistGradientBoostingClassifier(l2_regularization=0.2,
learning_rate=0.2,
max_iter=120,
max_leaf_nodes=10,
random_state=0)),
('XGBClassifier_tuned',
XGBClassifier(base_score=0.5, booster='gbtree',
colsample_bylevel=1,
colsample_bynode=1,
colsample_bytree=1,
enable_categorical=False, gamma=0,
gp...
predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1,
tree_method='exact',
use_label_encoder=False,
validate_parameters=1,
verbosity=0)),
('LogisticRegression_tuned',
LogisticRegression(C=0.3255088599835058,
max_iter=5000,
random_state=0))],
verbose=0, voting='soft',
weights=[0.5963099656272408, 0.2756042875483777,
0.1280857468243813])HistGradientBoostingClassifier(l2_regularization=0.2, learning_rate=0.2,
max_iter=120, max_leaf_nodes=10, random_state=0)XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.5, max_delta_step=0,
max_depth=4, min_child_weight=7, missing=nan,
monotone_constraints='()', n_estimators=100, n_jobs=1,
num_parallel_tree=1, predictor='auto', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', use_label_encoder=False,
validate_parameters=1, verbosity=0)LogisticRegression(C=0.3255088599835058, max_iter=5000, random_state=0)
print(classification_report(y_test, final.predict(X_test), digits=3)) precision recall f1-score support
0 0.896 0.946 0.920 1518
1 0.793 0.654 0.717 482
accuracy 0.875 2000
macro avg 0.845 0.800 0.818 2000
weighted avg 0.871 0.875 0.871 2000