In machine learning, building a model is only half of the engineering task.
The more important question is:
How well does the model generalize to unseen data?
A machine learning model should not simply memorize the training dataset.
Instead, it should learn the underlying patterns and relationships present in the data.
Model evaluation metrics help engineers understand:
The interpretation of metrics is often more important than the metric itself.
A model with extremely high training accuracy may still fail in production.
For example:
| Scenario | Interpretation |
|---|---|
| Low training error + High testing error | Overfitting |
| High training error + High testing error | Underfitting |
| Similar train and test errors | Good generalization |
| Excellent metric but poor physical interpretation | Dataset leakage or invalid modeling |
In engineering systems, models must not only produce good scores but must also remain:
Regression models predict continuous numerical values.
Examples include:
Common regression metrics include:
MAE measures the average absolute difference between prediction and actual values.
MAE = (1/n) * Σ |y_true - y_pred|
If:
MAE = 2 °C
then the model predictions are off by approximately ±2°C on average.
MAE is preferred when:
MSE squares the prediction errors before averaging them.
MSE = (1/n) * Σ (y_true - y_pred)^2
Suppose:
| MAE | MSE |
|---|---|
| Small | Very Large |
This situation usually indicates:
This is extremely important in engineering systems.
For example:
In such situations:
This is why engineers should never rely on only one metric.
RMSE is simply the square root of MSE.
RMSE = √MSE
RMSE is widely used because it balances:
For example:
RMSE = 5 MPa
means the model prediction deviates approximately by 5 MPa.
The relationship between MAE and RMSE reveals important information about prediction stability.
| Condition | Interpretation |
|---|---|
| RMSE ≈ MAE | Errors are uniformly distributed |
| RMSE » MAE | Presence of large outliers |
| Very low MAE but high RMSE | Model unstable in some regions |
| Both high | Overall poor model |
If RMSE is much larger than MAE:
RMSE >> MAE
then:
This often happens in:
R² measures how much variance in the target variable is explained by the model.
R² = 1 - (SS_res / SS_tot)
| R² Value | Meaning |
|---|---|
| 1.0 | Perfect prediction |
| 0.9 | Excellent fit |
| 0.7 | Good fit |
| 0.5 | Moderate fit |
| 0 | Model no better than mean prediction |
| Negative | Extremely poor model |
A high R² does not always mean a good model.
Example:
Always compare:
This is one of the most critical aspects of model evaluation.
| Dataset | RMSE |
|---|---|
| Training | 3.1 |
| Testing | 3.5 |
Small difference indicates:
| Dataset | RMSE |
|---|---|
| Training | 0.5 |
| Testing | 15.2 |
Interpretation:
| Dataset | RMSE |
|---|---|
| Training | 12 |
| Testing | 13 |
Interpretation:
Residuals are:
Residual = Actual - Predicted
Residual analysis helps identify:
Residuals should appear:
Patterns indicate:
Classification models predict categories instead of continuous values.
Examples include:
Common metrics include:
Accuracy measures overall correct predictions.
Accuracy = Correct Predictions / Total Predictions
High accuracy alone can be misleading.
Suppose:
A model predicting:
Always Normal
achieves:
99% accuracy
but completely fails at detecting failures.
This is why accuracy alone is dangerous in imbalanced engineering datasets.
A confusion matrix provides deeper insight.
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actual Positive | True Positive (TP) | False Negative (FN) |
| Actual Negative | False Positive (FP) | True Negative (TN) |
Precision measures prediction correctness among predicted positives.
Precision = TP / (TP + FP)
High precision means:
Important in:
Recall measures how many actual positives are detected.
Recall = TP / (TP + FN)
High recall means:
Critical in:
| Situation | Priority |
|---|---|
| Avoiding false alarms | High Precision |
| Avoiding missed failures | High Recall |
In turbine fault prediction:
Therefore:
Recall becomes more important than precision
F1 Score balances precision and recall.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Useful when:
ROC evaluates classification threshold performance.
| AUC | Meaning |
|---|---|
| 1.0 | Perfect classifier |
| 0.9 | Excellent |
| 0.8 | Good |
| 0.7 | Acceptable |
| 0.5 | Random guessing |
Understanding bias and variance is fundamental.
Cross validation improves evaluation reliability.
Instead of one train-test split:
Benefits:
In engineering applications, the best model is not always the one with the best score.
A practical engineering model should also be:
A slightly lower accuracy model with stable behavior is often preferred over an unstable high-accuracy model.
| Mistake | Consequence |
|---|---|
| Using only accuracy | Misleading conclusions |
| Ignoring train-test comparison | Hidden overfitting |
| Ignoring outliers | Unsafe predictions |
| Using R² alone | False confidence |
| Ignoring residual plots | Missed nonlinear behavior |
| Evaluating on training data only | Unrealistic performance |
| Ignoring data imbalance | Biased classification |
Model evaluation is not simply about obtaining a high score.
The true objective is understanding:
A strong engineer does not only train models.
A strong engineer critically analyzes model behavior, limitations, uncertainties, and reliability before deployment.