Anomaly detection (also called outlier detection) is the task of identifying data points, events, or observations that deviate significantly from the expected pattern or behavior.
Common applications:
Understanding the type of anomaly helps in choosing the right method.
The simplest approach: flag any point that is more than (k) standard deviations away from the mean.
\[z_i = \frac{x_i - \mu}{\sigma}\]| Points where ( | z_i | > 3) (the 3-sigma rule) are typically flagged as anomalies. |
Limitation: assumes data is normally distributed and stationary.
For time series, use a rolling mean and rolling standard deviation to adapt to local behavior:
\[\text{anomaly if } |x_t - \mu_{\text{roll}}| > k \cdot \sigma_{\text{roll}}\]This is more robust to trends and slow drifts.
A robust alternative to z-scores:
\[\text{IQR} = Q_3 - Q_1\]Flag points below (Q_1 - 1.5 \cdot \text{IQR}) or above (Q_3 + 1.5 \cdot \text{IQR}).
Less sensitive to extreme outliers skewing the mean.
Isolation Forest works by randomly partitioning the feature space using decision trees.
Key intuition: anomalies are easier to isolate (they require fewer splits) than normal points.
Works well for high-dimensional data and is computationally efficient.
Trains a boundary around the normal data in feature space.
Points outside this boundary are flagged as anomalies.
Useful when you only have normal examples to train on (semi-supervised setting).
LOF measures the local density of a point relative to its neighbors.
\[\text{LOF}(x) = \frac{\text{avg. local density of neighbors}}{\text{local density of } x}\]Good for detecting contextual and local anomalies that global methods miss.
An autoencoder is trained to reconstruct normal data:
Train only on normal data. At inference:
\[\text{reconstruction error} = \| x - \hat{x} \|^2\]High reconstruction error → the model struggles to reconstruct the point → likely anomalous.
This works well for time series, images, and tabular data.
For sequential data, an LSTM (Long Short-Term Memory) network can be trained to predict the next value in a series.
\[r_t = y_t - \hat{y}_t\]Large residuals (r_t) indicate anomalous time steps. This was briefly introduced in the Time Series chapter and is expanded here.
Anomaly detection is tricky to evaluate because anomalies are rare (class imbalance).
Avoid plain accuracy — a model that labels everything as normal can still achieve 99% accuracy if anomalies are 1% of data.
Preferred metrics:
import pandas as pd
import numpy as np
# Suppose 'ts' is a pandas Series with a DateTimeIndex
def rolling_zscore_anomalies(ts, window=30, threshold=3.0):
rolling_mean = ts.rolling(window=window).mean()
rolling_std = ts.rolling(window=window).std()
z_scores = (ts - rolling_mean) / rolling_std
anomalies = ts[np.abs(z_scores) > threshold]
return anomalies
anomalies = rolling_zscore_anomalies(ts)
print("Anomalies detected:")
print(anomalies)
import numpy as np
from sklearn.ensemble import IsolationForest
# X: feature array of shape (n_samples, n_features)
# For univariate time series, reshape: X = ts.values.reshape(-1, 1)
clf = IsolationForest(
n_estimators=100,
contamination=0.05, # expected fraction of anomalies
random_state=42
)
clf.fit(X)
# Returns +1 (normal) or -1 (anomaly)
labels = clf.predict(X)
anomaly_indices = np.where(labels == -1)[0]
print(f"Detected {len(anomaly_indices)} anomalies.")
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
# Assume X_train contains only normal samples, shape: (n_samples, n_features)
input_dim = X_train.shape[1]
encoding_dim = 8 # bottleneck size
# Build autoencoder
inputs = keras.Input(shape=(input_dim,))
encoded = layers.Dense(32, activation="relu")(inputs)
encoded = layers.Dense(encoding_dim, activation="relu")(encoded)
decoded = layers.Dense(32, activation="relu")(encoded)
decoded = layers.Dense(input_dim, activation="linear")(decoded)
autoencoder = keras.Model(inputs, decoded)
autoencoder.compile(optimizer="adam", loss="mse")
autoencoder.fit(
X_train, X_train,
epochs=50,
batch_size=32,
validation_split=0.1,
verbose=1
)
# Reconstruction error on new data
X_pred = autoencoder.predict(X_test)
reconstruction_errors = np.mean(np.square(X_test - X_pred), axis=1)
# Set threshold (e.g., 95th percentile of training errors)
X_train_pred = autoencoder.predict(X_train)
train_errors = np.mean(np.square(X_train - X_train_pred), axis=1)
threshold = np.percentile(train_errors, 95)
anomalies = np.where(reconstruction_errors > threshold)[0]
print(f"Detected {len(anomalies)} anomalies using autoencoder.")
| Method | Best For | Notes |
|---|---|---|
| Z-Score / IQR | Simple univariate, stationary data | Fast, interpretable |
| Rolling statistics | Time series with slow drift | Adaptive to local behavior |
| Isolation Forest | High-dimensional tabular data | Robust, scalable |
| LOF | Local/contextual anomalies | Sensitive to neighborhood size |
| One-Class SVM | Small datasets, semi-supervised | Slow on large data |
| Autoencoder | Complex patterns, images, sequences | Needs sufficient normal data |
| LSTM residuals | Sequential/time series data | Best for temporal dependencies |
Anomaly detection and time series forecasting are closely linked.
Many production systems combine both:
This forecast-then-detect pipeline is often the most interpretable approach in industrial settings such as predictive maintenance — directly relevant to sensor-based condition monitoring work.
An anomaly (outlier) is an observation that does not conform to the expected pattern.
Common approaches:
| Flag points where ( | r_t | ) is much larger than usual (e.g., beyond 3 standard deviations). |
import pandas as pd
from statsmodels.tsa.arima.model import ARIMA
# Example: univariate time series in a pandas Series
# index: DateTimeIndex, values: some measurement (e.g., daily sales)
ts = pd.read_csv("my_timeseries.csv", parse_dates=["date"], index_col="date")["value"]
# Train-test split (last 30 days as test)
train = ts.iloc[:-30]
test = ts.iloc[-30:]
# Fit ARIMA(p, d, q). This is just an example order.
model = ARIMA(train, order=(2, 1, 2))
model_fit = model.fit()
forecast = model_fit.forecast(steps=30)
print("Forecasted values:")
print(forecast)
import numpy as np
# Suppose 'residuals' is a pandas Series of y_t - y_hat_t
mean = residuals.mean()
std = residuals.std()
threshold = 3 * std # 3-sigma rule
anomalies = residuals[np.abs(residuals - mean) > threshold]
print("Anomalies detected at:")
print(anomalies.index)
These examples show how to build a basic ARIMA forecast and then use residuals to detect unusual points.