XGBoost: A Visual Guide

From Decision Trees to Gradient Boosted Ensembles

Dave Liu

Start Exploring Jump to Boosting
01 / Foundation
Decision Trees
A decision tree partitions feature space by asking a sequence of yes/no questions. Each internal node tests a feature, each branch follows an outcome, and each leaf assigns a prediction. Let's see one in action.

Interactive Decision Tree on 2D Data

The tree splits the space to separate two classes. Hover over nodes to see split conditions. The scatter plot on the right shows the resulting decision boundary.

Class A
Class B
Region A
Region B

The scatter plot shows two classes (green = positive, red = negative) in 2D feature space. Dashed lines show where the decision tree splits the space. Each rectangular region is assigned to whichever class is more common within it.

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=3)
tree.fit(X_train, y_train)
predictions = tree.predict(X_test)
02 / Better Together
Ensemble Methods: Why One Tree Isn't Enough
A single decision tree is prone to overfitting. Ensemble methods combine many weak learners into a strong one. The key insight: aggregate many rough guesses and the errors cancel out.

Single Tree vs. Random Forest vs. Boosting

Watch how the decision boundary changes as we move from a single tree to an ensemble approach. Click each tab to compare.

Single Tree: High variance, jagged boundaries. Prone to overfitting the noise in training data.

Compare how decision boundaries evolve. A single tree creates blocky regions. Random Forest averages many trees for smoother boundaries. Boosting sequentially corrects errors, creating increasingly refined boundaries. Use the slider to add trees.

Bias-Variance Tradeoff

As model complexity increases, bias drops but variance rises. The sweet spot minimizes total error.

Adding Trees One at a Time

Use the slider to add trees to the ensemble and watch the prediction boundary become smoother and more accurate.

1
03 / The Core Algorithm
Gradient Boosting Step by Step
Gradient boosting builds trees sequentially. Each new tree corrects the errors of all previous trees combined. This section walks you through each iteration on a simple 1D regression problem.

Boosting Rounds: Watch Residuals Shrink

Start with a constant prediction (the mean), then iteratively fit trees to the residuals. Each tree nudges the prediction closer to the truth.

0
0.3
Prediction vs Actual
Residuals
Round 0: We start with the simplest model: predict the mean of all y values. The residuals (actual - predicted) show how far off we are.

Left chart: blue dots are actual data points, the green line is the model's current prediction. Right chart: residuals (errors) as bars — positive bars mean underprediction, negative bars mean overprediction. Use the slider or Play button to watch residuals shrink as boosting rounds are added.

The Algorithm in Steps

1
Initialize
Start with F0(x) = mean(y). This is our base prediction.
2
Compute Residuals
ri = yi - Fm-1(xi). These are the negative gradients of the loss.
3
Fit a Tree to Residuals
Train a shallow decision tree hm(x) that predicts the residuals.
4
Update Model
Fm(x) = Fm-1(x) + η · hm(x), where η is the learning rate.
5
Repeat
Go back to step 2. Each round the residuals get smaller.
import xgboost as xgb

model = xgb.XGBRegressor(
    n_estimators=100,
    learning_rate=0.3,
    max_depth=3
)
model.fit(X_train, y_train)
preds = model.predict(X_test)
04 / Tuning
Learning Rate and Regularization
The learning rate controls how much each tree contributes. A lower rate requires more trees but often generalizes better. Think of it as taking smaller, more careful steps toward the solution.

Learning Rate Comparison

Compare how different learning rates converge. Low rates (0.05) need many trees. High rates (1.0) converge fast but can overshoot.

10
lr = 1.0
lr = 0.3
lr = 0.1
lr = 0.05

Each colored line shows how training loss decreases with boosting rounds at a different learning rate. Lower learning rates (blue, green) converge more slowly but reach better final loss. Higher rates (red) converge fast but plateau early or overfit.

# Lower learning rate + more trees = better generalization
model_careful = xgb.XGBRegressor(learning_rate=0.05, n_estimators=500)
model_fast    = xgb.XGBRegressor(learning_rate=1.0,  n_estimators=50)
05 / What Makes XGBoost Special
XGBoost-Specific Innovations
XGBoost isn't just gradient boosting. It adds regularization, approximate algorithms, sparsity awareness, and engineering optimizations that make it fast and hard to overfit.

Regularized Objective

XGBoost adds L1 and L2 penalties on the leaf weights, shrinking them toward zero to prevent overfitting.

1.0
0.0
Obj = Loss + γT + ½λ∑wj² + α∑|wj|
T = number of leaves, wj = leaf weights

Histogram-Based Split Finding

Instead of testing every unique value, XGBoost buckets continuous features into quantiles and tests bucket boundaries.

Approximate algorithm: O(data × bins) instead of O(data × features × values). Typically 256 bins suffice.

Sparsity-Aware Splits

Missing values are routed to the child that reduces loss most. XGBoost learns the optimal default direction during training.

Default direction: For each split, XGBoost tests sending missing values left vs right and picks the better option.

Column Subsampling

Like random forests, XGBoost can randomly select a subset of features for each tree or split, reducing correlation between trees.

0.75
model = xgb.XGBRegressor(
    reg_lambda=1.0,        # L2 regularization
    reg_alpha=0.0,         # L1 regularization
    colsample_bytree=0.8, # 80% of features per tree
    tree_method='hist',    # histogram-based splitting
    max_bin=256,           # number of bins for histograms
)
06 / Knowing When to Stop
Overfitting and Early Stopping
More trees always reduce training loss, but at some point validation loss starts rising. Early stopping monitors validation performance and halts when it stops improving.

Training vs Validation Loss

Drag the slider to set the number of boosting rounds. The vertical line shows where early stopping would kick in.

200
Early stopping at round 65: Validation loss reached its minimum here. Continuing beyond this point means memorizing noise.

Blue line = training loss (always decreasing). Red line = validation loss (decreases then increases). The vertical dashed line marks the optimal stopping point where validation loss is minimized. Beyond this point, the model memorizes training data.

model = xgb.XGBRegressor(n_estimators=1000, early_stopping_rounds=10)
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)
# model.best_iteration gives the optimal round
07 / Interpretation
Feature Importance
XGBoost provides three measures of feature importance: gain (improvement in loss), cover (number of samples affected), and frequency (number of times a feature is used in splits).

House Price Prediction: Feature Importance

Toggle between different importance metrics. Gain and frequency can tell very different stories.

Horizontal bars show each feature's importance. Toggle between Gain (how much a feature improves predictions when used), Cover (how many data points a feature affects), and Frequency (how often a feature appears in trees). These can rank features very differently.

model.fit(X_train, y_train)

# Three types of importance
xgb.plot_importance(model, importance_type='gain')
xgb.plot_importance(model, importance_type='cover')
xgb.plot_importance(model, importance_type='weight')  # frequency
08 / Reference
Hyperparameter Cheat Sheet
The most important hyperparameters, what they control, and sensible defaults. Tuning these is where the art of gradient boosting meets the science.
09 / Landscape
XGBoost vs LightGBM vs CatBoost
Three dominant gradient boosting libraries. They share the core algorithm but differ in tree growth strategy, categorical handling, and performance characteristics.

Tree Growth: Level-wise vs Leaf-wise

XGBoost grows trees level by level (breadth-first). LightGBM grows the leaf with the highest loss reduction (depth-first). CatBoost uses symmetric (oblivious) trees.

Level-wise (XGBoost)
Leaf-wise (LightGBM)
Symmetric (CatBoost)

Level-wise (XGBoost) grows all nodes at the same depth before going deeper — balanced but potentially wasteful. Leaf-wise (LightGBM) always splits the leaf with highest loss reduction — faster but risks overfitting. Symmetric (CatBoost) forces identical splits at each level — naturally regularized.

Feature Comparison

FeatureXGBoostLightGBMCatBoost
Tree GrowthLevel-wiseLeaf-wiseSymmetric
Speed (large data)FastFastestFast
Categorical FeaturesManual encodingNative (optimal split)Native (ordered TS)
Missing ValuesLearned directionNativeNative
GPU SupportYesYesYes
RegularizationL1 + L2 on weightsL1 + L2 on weightsL2 + random permutations
Overfitting RiskMediumHigher (deep trees)Lower
Best ForGeneral purposeLarge datasetsCategorical-heavy data

Approximate Training Speed

Relative training time on a large tabular dataset (lower is better).