Neural networks

Intuition for Neural Networks

Neural networks are function approximators inspired (loosely) by the brain.
They take an input vector (x), apply a sequence of linear transformations and nonlinear activations, and output a prediction (\hat{y}).

At a high level:

[ x \rightarrow \text{(Linear)} \rightarrow \text{(Nonlinearity)} \rightarrow \dots \rightarrow \text{Output} ]

By stacking many such layers, networks can model very complex, highly nonlinear relationships in data.


Perceptron and Neuron

The basic building block is a neuron (or perceptron):

[ z = w^\top x + b,\quad a = \sigma(z) ]

where:

  • (x) is the input vector
  • (w) is the weight vector
  • (b) is the bias
  • (\sigma(\cdot)) is an activation function (e.g., ReLU, sigmoid, tanh)
  • (a) is the neuron output

Common activations:

  • ReLU: (\text{ReLU}(z) = \max(0, z))
  • Sigmoid: (\sigma(z) = \frac{1}{1 + e^{-z}})
  • Tanh: (\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}})

Feedforward Neural Network (MLP)

A simple fully connected network has:

  1. Input layer: holds the features (no parameters).
  2. Hidden layers: one or more layers of neurons with nonlinear activations.
  3. Output layer: produces the final prediction.

Example with one hidden layer:

[ h = \sigma(W_1 x + b_1),\quad \hat{y} = f(W_2 h + b_2) ]

where (f) is usually:

  • Identity (for regression)
  • Sigmoid (for binary classification)
  • Softmax (for multi‑class classification)

Training: Loss and Backpropagation

Training a neural network means finding weights ({W, b}) that minimize a loss function on the training data.

Typical losses:

  • Mean Squared Error (MSE) for regression.
  • Binary Cross‑Entropy for binary classification.
  • Categorical Cross‑Entropy for multi‑class classification.

The key algorithm is backpropagation:

  1. Forward pass: compute predictions (\hat{y}) for a batch of inputs.
  2. Compute loss: compare (\hat{y}) with true labels (y).
  3. Backward pass: propagate gradients of the loss w.r.t. each parameter using the chain rule.
  4. Update parameters: with an optimizer such as Stochastic Gradient Descent (SGD) or Adam.

This process repeats for many epochs until the loss stops improving (or early stopping kicks in).


Overfitting and Regularization

Neural networks can easily overfit, especially with many parameters. Common regularization techniques:

  • L2 weight decay: penalize large weights in the loss function.
  • Dropout: randomly “drop” (set to zero) a fraction of activations during training.
  • Early stopping: stop training when validation loss stops improving.
  • Batch normalization: normalize activations within a mini‑batch to stabilize training.

When to Use Neural Networks

Neural networks are especially powerful when:

  • You have large amounts of data.
  • The relationship between inputs and outputs is highly nonlinear.
  • You work with images, text, audio, or time series, where deep architectures (CNNs, RNNs, Transformers) shine.

For small tabular datasets, simpler models (tree ensembles, linear models) are often competitive or better.


Simple Example in Python (Keras)

import numpy as np
from tensorflow import keras
from tensorflow.keras import layers

# Dummy data: regression example
X = np.random.randn(1000, 10)
y = (X[:, 0] * 2.0 + X[:, 1] * -3.0 + 0.5 * np.random.randn(1000))

model = keras.Sequential([
    layers.Dense(32, activation="relu", input_shape=(10,)),
    layers.Dense(16, activation="relu"),
    layers.Dense(1)  # regression output
])

model.compile(
    optimizer="adam",
    loss="mse",
    metrics=["mae"]
)

model.summary()

model.fit(
    X, y,
    epochs=20,
    batch_size=32,
    validation_split=0.2,
    verbose=1
)

This builds a small feedforward neural network for a toy regression task, trains it, and reports loss/MAE on a validation split.