Clustering algorithm
Intuition for Clustering
Clustering is an unsupervised learning task where we group similar data points together without labels.
The goal is to discover hidden structure in the data, such as customer segments or similar documents.
High‑level idea:
- Define a notion of similarity (often distance in feature space).
- Group points so that those in the same cluster are more similar to each other than to points in other clusters.
K‑Means Clustering
K‑Means is one of the most popular clustering algorithms.
Given a desired number of clusters (K):
- Initialize (K) cluster centers (centroids) ({\mu_1, \dots, \mu_K}) (often randomly).
- Assignment step: assign each point (x_i) to the nearest centroid (by Euclidean distance).
- Update step: recompute each centroid as the mean of all points assigned to that cluster.
- Repeat steps 2–3 until assignments stop changing (or a max number of iterations is reached).
The objective function minimized is:
[
J = \sum_{k=1}^{K} \sum_{x_i \in C_k} | x_i - \mu_k |^2
]
where (C_k) is the set of points in cluster (k).
Choosing K: the Elbow Method (WCSS)
The Within‑Cluster Sum of Squares (WCSS) is essentially the objective (J) above.
As (K) increases, WCSS decreases, but after a point the improvement becomes marginal.
Steps for the elbow method:
- Run K‑Means for different values of (K) (e.g., 1 to 10).
- Record WCSS (or inertia in scikit‑learn) for each (K).
- Plot WCSS vs. (K) and look for an “elbow” where the rate of decrease sharply slows.
- Choose (K) around that elbow point.
Other Clustering Algorithms (Brief Overview)
- Hierarchical Clustering
- Builds a tree (dendrogram) of clusters.
- Can be agglomerative (bottom‑up) or divisive (top‑down).
- Does not require pre‑specifying (K); you cut the dendrogram at a chosen level.
- DBSCAN (Density‑Based Spatial Clustering of Applications with Noise)
- Groups points based on density (regions of high point concentration).
- Can find arbitrarily shaped clusters.
- Identifies noise/outliers as points not belonging to any dense region.
- Gaussian Mixture Models (GMMs)
- Assume data are generated from a mixture of Gaussian distributions.
- Provide soft assignments (probabilities of belonging to each cluster).
- Fitted using the EM algorithm.
Pros and Cons of K‑Means
Advantages
- Simple and fast; scales well to large datasets.
- Easy to implement and understand.
- Works well when clusters are roughly spherical and similar in size.
Disadvantages
- Requires choosing (K) in advance.
- Sensitive to initialization and outliers.
- Struggles with non‑spherical clusters or clusters of very different densities.
Simple Example in Python (scikit‑learn)
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
# Generate synthetic 2D data
X, y_true = make_blobs(
n_samples=300,
centers=4,
cluster_std=0.60,
random_state=42
)
# Fit K-Means with K=4
kmeans = KMeans(n_clusters=4, random_state=42)
clusters = kmeans.fit_predict(X)
centers = kmeans.cluster_centers_
# Plot clusters and centroids
plt.scatter(X[:, 0], X[:, 1], c=clusters, s=30, cmap="viridis")
plt.scatter(
centers[:, 0], centers[:, 1],
c="red", s=200, alpha=0.7, marker="X"
)
plt.title("K-Means Clustering (K=4)")
plt.show()
This code generates synthetic data, applies K‑Means with (K=4), and visualizes the resulting clusters and centroids.