Tree models

Intuition for Tree Models

Tree-based models split the feature space into regions using a sequence of yes/no questions.
At each internal node, a decision rule such as “(x_j \leq t)?” sends samples left or right, until we reach a leaf node that holds a prediction.

  • Classification trees: leaves store a class label or class probabilities.
  • Regression trees: leaves store a numeric value, usually the average target in that region.

This makes trees easy to interpret: you can read the path from root to leaf as a human-readable decision rule.


How Decision Trees Are Built

Given training data ((X, y)), a tree is built top‑down:

  1. Choose a split on some feature (x_j) at threshold (t).
  2. Evaluate impurity (how “mixed” the labels are) before and after the split.
  3. Pick the split that maximizes impurity reduction (or loss reduction).
  4. Recurse on the left and right subsets until a stopping rule is met (max depth, min samples per leaf, etc.).

Common Impurity Measures (Classification)

  • Gini impurity for a node:

[ G = \sum_{k} p_k (1 - p_k) ]

where (p_k) is the fraction of class (k) in the node.

  • Entropy:

[ H = -\sum_{k} p_k \log_2(p_k) ]

The best split is the one that reduces impurity the most.

Loss for Regression Trees

For regression, impurity is usually the Mean Squared Error (MSE) within each node:

[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2 ]

where (\bar{y}) is the mean target value in that node.


Overfitting and Tree Complexity

Unconstrained trees tend to overfit: they keep splitting until each leaf contains only a few (or even one) training samples.

Typical controls:

  • Max depth: limit how deep the tree can grow.
  • Min samples per split/leaf: require a minimum number of samples to create a new split or leaf.
  • Max features: when searching for the best split, consider only a subset of features.

Another approach is post‑pruning:

  1. Grow a large tree.
  2. Iteratively remove branches that give only a tiny improvement on validation error.
  3. Choose the smallest tree with good performance (bias–variance trade‑off).

Pros and Cons of Decision Trees

Advantages

  • Easy to interpret and visualize.
  • Handle non‑linear relationships and feature interactions naturally.
  • Work with mixed data types (numerical + categorical with proper encoding).
  • Require little feature scaling.

Disadvantages

  • Individual trees are high variance (sensitive to small changes in data).
  • Can create axis‑aligned splits only, which may be suboptimal for some problems.
  • Often outperformed by ensembles of trees.

Tree Ensembles (Brief Overview)

To improve performance and stability, we often combine many trees:

  • Random Forests:
    • Train many trees on bootstrapped samples of the data.
    • At each split, consider only a random subset of features.
    • Predictions are averaged (regression) or voted (classification).
    • Great baseline with good performance and robustness.
  • Gradient Boosted Trees (e.g., XGBoost, LightGBM, CatBoost):
    • Build trees sequentially, each new tree fits the residuals/errors of the previous ones.
    • Very strong performance on structured/tabular data, at the cost of more tuning.

Simple Example in Python (scikit‑learn)

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load data
X, y = load_iris(return_X_y=True)

# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and fit a tree
clf = DecisionTreeClassifier(
    max_depth=3,        # control overfitting
    random_state=42
)
clf.fit(X_train, y_train)

# Evaluate
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

This code trains a small decision tree on the classic Iris dataset and reports the classification accuracy.