modeling

Supervised Model Evaluations

In machine learning, building a model is only half of the engineering task.
The more important question is:

How well does the model generalize to unseen data?

A machine learning model should not simply memorize the training dataset.
Instead, it should learn the underlying patterns and relationships present in the data.

Model evaluation metrics help engineers understand:

Prediction quality
Generalization capability
Sensitivity to noise
Stability of the model
Bias and variance behavior
Practical engineering usefulness

The interpretation of metrics is often more important than the metric itself.

Why Model Evaluation Matters

A model with extremely high training accuracy may still fail in production.

For example:

Scenario	Interpretation
Low training error + High testing error	Overfitting
High training error + High testing error	Underfitting
Similar train and test errors	Good generalization
Excellent metric but poor physical interpretation	Dataset leakage or invalid modeling

In engineering systems, models must not only produce good scores but must also remain:

Physically meaningful
Stable under noise
Robust to unseen operating conditions
Interpretable when possible

Understanding Regression Evaluation Metrics

Regression models predict continuous numerical values.

Examples include:

Temperature prediction
Pressure estimation
Vibration amplitude prediction
Remaining useful life estimation
Material property prediction
Flow rate prediction

Common regression metrics include:

MAE
MSE
RMSE
R² Score
Adjusted R²
MAPE

Mean Absolute Error (MAE)

MAE measures the average absolute difference between prediction and actual values.

Formula

MAE = (1/n) * Σ |y_true - y_pred|

Interpretation

Easy to understand
Measured in the same units as the target variable
Less sensitive to large outliers
Represents average prediction deviation

Engineering Interpretation

If:

MAE = 2 °C

then the model predictions are off by approximately ±2°C on average.

When MAE is Useful

MAE is preferred when:

Outliers should not dominate the metric
Robustness is more important than extreme precision
Moderate errors are acceptable

Mean Squared Error (MSE)

MSE squares the prediction errors before averaging them.

Formula

MSE = (1/n) * Σ (y_true - y_pred)^2

Interpretation

Large errors are penalized heavily
Sensitive to outliers
Useful when large deviations are dangerous

Engineering Interpretation

Suppose:

MAE	MSE
Small	Very Large

This situation usually indicates:

Most predictions are good
A few predictions are extremely bad
Presence of outliers
Possible instability in certain operating regions

This is extremely important in engineering systems.

For example:

A turbine vibration model may perform well normally
But fail catastrophically near resonance regions

In such situations:

MAE may appear acceptable
MSE reveals dangerous extreme failures

This is why engineers should never rely on only one metric.

Root Mean Squared Error (RMSE)

RMSE is simply the square root of MSE.

Formula

RMSE = √MSE

Interpretation

Same units as target variable
Easier to interpret physically than MSE
Still penalizes large errors strongly

Engineering Significance

RMSE is widely used because it balances:

Physical interpretability
Outlier sensitivity

For example:

RMSE = 5 MPa

means the model prediction deviates approximately by 5 MPa.

Relationship Between MAE and RMSE

The relationship between MAE and RMSE reveals important information about prediction stability.

Condition	Interpretation
RMSE ≈ MAE	Errors are uniformly distributed
RMSE » MAE	Presence of large outliers
Very low MAE but high RMSE	Model unstable in some regions
Both high	Overall poor model

Important Insight

If RMSE is much larger than MAE:

RMSE >> MAE

then:

Some predictions contain very large errors
Model reliability is questionable
Outlier analysis becomes necessary

This often happens in:

Sensor failures
Nonlinear transitions
Sparse datasets
Extrapolation regions

R² Score (Coefficient of Determination)

R² measures how much variance in the target variable is explained by the model.

Formula

R² = 1 - (SS_res / SS_tot)

Interpretation

R² Value	Meaning
1.0	Perfect prediction
0.9	Excellent fit
0.7	Good fit
0.5	Moderate fit
0	Model no better than mean prediction
Negative	Extremely poor model

Important Engineering Note

A high R² does not always mean a good model.

Example:

Model may memorize training data
Leakage may exist
Physically impossible relationships may be learned

Always compare:

Training R²
Testing R²

Training vs Testing Metrics

This is one of the most critical aspects of model evaluation.

Good Generalization

Dataset	RMSE
Training	3.1
Testing	3.5

Small difference indicates:

Stable learning
Good generalization
Low overfitting

Overfitting Example

Dataset	RMSE
Training	0.5
Testing	15.2

Interpretation:

Model memorized training data
Poor real-world performance
High variance model

Underfitting Example

Dataset	RMSE
Training	12
Testing	13

Interpretation:

Model too simple
Unable to learn relationships
High bias problem

Residual Analysis

Residuals are:

Residual = Actual - Predicted

Residual analysis helps identify:

Nonlinearity
Bias
Missing physics
Heteroscedasticity
Sensor issues

Good Residual Pattern

Residuals should appear:

Random
Centered around zero
Without patterns

Bad Residual Pattern

Patterns indicate:

Missing variables
Incorrect model assumptions
Uncaptured nonlinear behavior

Classification Model Evaluations

Classification models predict categories instead of continuous values.

Examples include:

Defect detection
Failure prediction
Spam detection
Fault classification
Disease classification

Common metrics include:

Accuracy
Precision
Recall
F1 Score
ROC-AUC
Confusion Matrix

Accuracy

Accuracy measures overall correct predictions.

Formula

Accuracy = Correct Predictions / Total Predictions

Interpretation

High accuracy alone can be misleading.

Example

Suppose:

99% of samples are normal
1% are failures

A model predicting:

Always Normal

achieves:

99% accuracy

but completely fails at detecting failures.

This is why accuracy alone is dangerous in imbalanced engineering datasets.

Confusion Matrix

A confusion matrix provides deeper insight.

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Precision

Precision measures prediction correctness among predicted positives.

Formula

Precision = TP / (TP + FP)

Engineering Interpretation

High precision means:

Few false alarms
Reliable positive predictions

Important in:

Alarm systems
Defect identification
Quality inspection

Recall

Recall measures how many actual positives are detected.

Formula

Recall = TP / (TP + FN)

Engineering Interpretation

High recall means:

Few missed failures
Better fault detection

Critical in:

Safety systems
Failure prediction
Medical diagnosis
Industrial monitoring

Precision vs Recall Tradeoff

Situation	Priority
Avoiding false alarms	High Precision
Avoiding missed failures	High Recall

Example

In turbine fault prediction:

Missing a fault may destroy equipment
False alarms only trigger inspections

Therefore:

Recall becomes more important than precision

F1 Score

F1 Score balances precision and recall.

Formula

F1 = 2 × (Precision × Recall) / (Precision + Recall)

Interpretation

Useful when:

Dataset is imbalanced
Both precision and recall matter

ROC Curve and AUC

ROC evaluates classification threshold performance.

AUC Interpretation

AUC	Meaning
1.0	Perfect classifier
0.9	Excellent
0.8	Good
0.7	Acceptable
0.5	Random guessing

Bias vs Variance

Understanding bias and variance is fundamental.

High Bias

Model too simple
Underfits data
Poor training and testing scores

High Variance

Model memorizes training data
Excellent training score
Poor testing score

Cross Validation

Cross validation improves evaluation reliability.

Instead of one train-test split:

Multiple splits are created
Metrics are averaged

Benefits:

More robust evaluation
Reduced randomness
Better estimation of generalization

Engineering Perspective on Model Evaluation

In engineering applications, the best model is not always the one with the best score.

A practical engineering model should also be:

Physically meaningful
Stable under varying conditions
Computationally efficient
Interpretable when required
Robust to sensor noise
Reliable during extrapolation

A slightly lower accuracy model with stable behavior is often preferred over an unstable high-accuracy model.

Common Mistakes During Model Evaluation

Mistake	Consequence
Using only accuracy	Misleading conclusions
Ignoring train-test comparison	Hidden overfitting
Ignoring outliers	Unsafe predictions
Using R² alone	False confidence
Ignoring residual plots	Missed nonlinear behavior
Evaluating on training data only	Unrealistic performance
Ignoring data imbalance	Biased classification

Final Thoughts

Model evaluation is not simply about obtaining a high score.

The true objective is understanding:

Why the model behaves the way it does
Where the model fails
Whether the model generalizes
Whether predictions remain trustworthy in real engineering environments

A strong engineer does not only train models.

A strong engineer critically analyzes model behavior, limitations, uncertainties, and reliability before deployment.

Contents

modeling

Supervised Model Evaluations

Why Model Evaluation Matters

Understanding Regression Evaluation Metrics

Mean Absolute Error (MAE)

Formula

Interpretation

Engineering Interpretation

When MAE is Useful

Mean Squared Error (MSE)

Formula

Interpretation

Engineering Interpretation

Root Mean Squared Error (RMSE)

Formula

Interpretation

Engineering Significance

Relationship Between MAE and RMSE

Important Insight

R² Score (Coefficient of Determination)

Formula

Interpretation

Important Engineering Note

Training vs Testing Metrics

Good Generalization

Overfitting Example

Underfitting Example

Residual Analysis

Good Residual Pattern

Bad Residual Pattern

Classification Model Evaluations

Accuracy

Formula

Interpretation

Example

Confusion Matrix

Precision

Formula

Engineering Interpretation

Recall

Formula

Engineering Interpretation

Precision vs Recall Tradeoff

Example

F1 Score

Formula

Interpretation

ROC Curve and AUC

AUC Interpretation

Bias vs Variance

High Bias

High Variance

Cross Validation

Engineering Perspective on Model Evaluation

Common Mistakes During Model Evaluation

Final Thoughts