From Decision Trees to Gradient Boosted Ensembles
The tree splits the space to separate two classes. Hover over nodes to see split conditions. The scatter plot on the right shows the resulting decision boundary.
The scatter plot shows two classes (green = positive, red = negative) in 2D feature space. Dashed lines show where the decision tree splits the space. Each rectangular region is assigned to whichever class is more common within it.
from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier(max_depth=3) tree.fit(X_train, y_train) predictions = tree.predict(X_test)
Watch how the decision boundary changes as we move from a single tree to an ensemble approach. Click each tab to compare.
Compare how decision boundaries evolve. A single tree creates blocky regions. Random Forest averages many trees for smoother boundaries. Boosting sequentially corrects errors, creating increasingly refined boundaries. Use the slider to add trees.
As model complexity increases, bias drops but variance rises. The sweet spot minimizes total error.
Use the slider to add trees to the ensemble and watch the prediction boundary become smoother and more accurate.
Start with a constant prediction (the mean), then iteratively fit trees to the residuals. Each tree nudges the prediction closer to the truth.
Left chart: blue dots are actual data points, the green line is the model's current prediction. Right chart: residuals (errors) as bars — positive bars mean underprediction, negative bars mean overprediction. Use the slider or Play button to watch residuals shrink as boosting rounds are added.
import xgboost as xgb model = xgb.XGBRegressor( n_estimators=100, learning_rate=0.3, max_depth=3 ) model.fit(X_train, y_train) preds = model.predict(X_test)
Compare how different learning rates converge. Low rates (0.05) need many trees. High rates (1.0) converge fast but can overshoot.
Each colored line shows how training loss decreases with boosting rounds at a different learning rate. Lower learning rates (blue, green) converge more slowly but reach better final loss. Higher rates (red) converge fast but plateau early or overfit.
# Lower learning rate + more trees = better generalization model_careful = xgb.XGBRegressor(learning_rate=0.05, n_estimators=500) model_fast = xgb.XGBRegressor(learning_rate=1.0, n_estimators=50)
XGBoost adds L1 and L2 penalties on the leaf weights, shrinking them toward zero to prevent overfitting.
Instead of testing every unique value, XGBoost buckets continuous features into quantiles and tests bucket boundaries.
Missing values are routed to the child that reduces loss most. XGBoost learns the optimal default direction during training.
Like random forests, XGBoost can randomly select a subset of features for each tree or split, reducing correlation between trees.
model = xgb.XGBRegressor( reg_lambda=1.0, # L2 regularization reg_alpha=0.0, # L1 regularization colsample_bytree=0.8, # 80% of features per tree tree_method='hist', # histogram-based splitting max_bin=256, # number of bins for histograms )
Drag the slider to set the number of boosting rounds. The vertical line shows where early stopping would kick in.
Blue line = training loss (always decreasing). Red line = validation loss (decreases then increases). The vertical dashed line marks the optimal stopping point where validation loss is minimized. Beyond this point, the model memorizes training data.
model = xgb.XGBRegressor(n_estimators=1000, early_stopping_rounds=10) model.fit( X_train, y_train, eval_set=[(X_val, y_val)], verbose=False ) # model.best_iteration gives the optimal round
Toggle between different importance metrics. Gain and frequency can tell very different stories.
Horizontal bars show each feature's importance. Toggle between Gain (how much a feature improves predictions when used), Cover (how many data points a feature affects), and Frequency (how often a feature appears in trees). These can rank features very differently.
model.fit(X_train, y_train) # Three types of importance xgb.plot_importance(model, importance_type='gain') xgb.plot_importance(model, importance_type='cover') xgb.plot_importance(model, importance_type='weight') # frequency
XGBoost grows trees level by level (breadth-first). LightGBM grows the leaf with the highest loss reduction (depth-first). CatBoost uses symmetric (oblivious) trees.
Level-wise (XGBoost) grows all nodes at the same depth before going deeper — balanced but potentially wasteful. Leaf-wise (LightGBM) always splits the leaf with highest loss reduction — faster but risks overfitting. Symmetric (CatBoost) forces identical splits at each level — naturally regularized.
| Feature | XGBoost | LightGBM | CatBoost |
|---|---|---|---|
| Tree Growth | Level-wise | Leaf-wise | Symmetric |
| Speed (large data) | Fast | Fastest | Fast |
| Categorical Features | Manual encoding | Native (optimal split) | Native (ordered TS) |
| Missing Values | Learned direction | Native | Native |
| GPU Support | Yes | Yes | Yes |
| Regularization | L1 + L2 on weights | L1 + L2 on weights | L2 + random permutations |
| Overfitting Risk | Medium | Higher (deep trees) | Lower |
| Best For | General purpose | Large datasets | Categorical-heavy data |
Relative training time on a large tabular dataset (lower is better).