Tree models
Intuition for Tree Models
Tree-based models split the feature space into regions using a sequence of yes/no questions.
At each internal node, a decision rule such as “(x_j \leq t)?” sends samples left or right, until we reach a leaf node that holds a prediction.
- Classification trees: leaves store a class label or class probabilities.
- Regression trees: leaves store a numeric value, usually the average target in that region.
This makes trees easy to interpret: you can read the path from root to leaf as a human-readable decision rule.
How Decision Trees Are Built
Given training data ((X, y)), a tree is built top‑down:
- Choose a split on some feature (x_j) at threshold (t).
- Evaluate impurity (how “mixed” the labels are) before and after the split.
- Pick the split that maximizes impurity reduction (or loss reduction).
- Recurse on the left and right subsets until a stopping rule is met (max depth, min samples per leaf, etc.).
Common Impurity Measures (Classification)
- Gini impurity for a node:
[
G = \sum_{k} p_k (1 - p_k)
]
where (p_k) is the fraction of class (k) in the node.
[
H = -\sum_{k} p_k \log_2(p_k)
]
The best split is the one that reduces impurity the most.
Loss for Regression Trees
For regression, impurity is usually the Mean Squared Error (MSE) within each node:
[
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \bar{y})^2
]
where (\bar{y}) is the mean target value in that node.
Overfitting and Tree Complexity
Unconstrained trees tend to overfit: they keep splitting until each leaf contains only a few (or even one) training samples.
Typical controls:
- Max depth: limit how deep the tree can grow.
- Min samples per split/leaf: require a minimum number of samples to create a new split or leaf.
- Max features: when searching for the best split, consider only a subset of features.
Another approach is post‑pruning:
- Grow a large tree.
- Iteratively remove branches that give only a tiny improvement on validation error.
- Choose the smallest tree with good performance (bias–variance trade‑off).
Pros and Cons of Decision Trees
Advantages
- Easy to interpret and visualize.
- Handle non‑linear relationships and feature interactions naturally.
- Work with mixed data types (numerical + categorical with proper encoding).
- Require little feature scaling.
Disadvantages
- Individual trees are high variance (sensitive to small changes in data).
- Can create axis‑aligned splits only, which may be suboptimal for some problems.
- Often outperformed by ensembles of trees.
Tree Ensembles (Brief Overview)
To improve performance and stability, we often combine many trees:
- Random Forests:
- Train many trees on bootstrapped samples of the data.
- At each split, consider only a random subset of features.
- Predictions are averaged (regression) or voted (classification).
- Great baseline with good performance and robustness.
- Gradient Boosted Trees (e.g., XGBoost, LightGBM, CatBoost):
- Build trees sequentially, each new tree fits the residuals/errors of the previous ones.
- Very strong performance on structured/tabular data, at the cost of more tuning.
Simple Example in Python (scikit‑learn)
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Load data
X, y = load_iris(return_X_y=True)
# Train / test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create and fit a tree
clf = DecisionTreeClassifier(
max_depth=3, # control overfitting
random_state=42
)
clf.fit(X_train, y_train)
# Evaluate
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
This code trains a small decision tree on the classic Iris dataset and reports the classification accuracy.