Basics of Computer Vision

What is Computer Vision?

Computer Vision (CV) is the field of AI that enables machines to understand and interpret images and videos.
Typical goals include recognizing objects, detecting faces, reading text, and understanding scene structure.

Digital images can be seen as tensors:

  • Grayscale image: height × width
  • RGB image: height × width × 3 (channels)

Each pixel stores intensity values (e.g., 0–255), which become numerical features for our models.


Core Tasks in Computer Vision

  • Image Classification: assign a label to the entire image (cat vs. dog).
  • Object Detection: locate and classify multiple objects with bounding boxes.
  • Semantic Segmentation: classify each pixel (e.g., road, car, pedestrian).
  • Instance Segmentation: segment individual object instances.
  • Keypoint / Pose Estimation: detect human joints, facial landmarks, etc.

Many modern systems are built on Convolutional Neural Networks (CNNs).


Intuition for Convolutional Neural Networks (CNNs)

CNNs exploit two key ideas:

  1. Local receptive fields: neurons look at small patches of the image (e.g., 3×3, 5×5).
  2. Weight sharing: the same filter (kernel) is slid across the image, detecting the same pattern everywhere.

This makes CNNs:

  • Translation‑equivariant (features move with the object).
  • Parameter‑efficient compared to fully connected networks on large images.

A typical block:

  1. Convolution layer → learns filters for edges, textures, shapes.
  2. Nonlinearity (ReLU) → adds capacity and keeps features positive.
  3. Pooling (e.g., max pooling) → downsamples, making features more robust to small shifts.

Stacking several blocks builds hierarchical features (edges → corners → textures → object parts → objects).


Data Preparation and Augmentation

Good performance in CV depends heavily on data:

  • Normalization: scale pixel values (e.g., to [0, 1] or mean‑zero, unit variance).
  • Resizing / cropping: ensure consistent input size.
  • Data augmentation:
    • Random flips, rotations, small translations.
    • Brightness / contrast jitter.
    • Random crops and zooms.

Augmentation helps networks generalize better and reduces overfitting, especially when labeled data are limited.


Simple CNN Example in Python (Keras)

from tensorflow import keras
from tensorflow.keras import layers

# Example: small CNN for image classification (e.g., CIFAR-10-like)
input_shape = (32, 32, 3)  # height, width, channels
num_classes = 10

model = keras.Sequential([
    layers.Conv2D(32, (3, 3), activation="relu", input_shape=input_shape),
    layers.MaxPooling2D((2, 2)),

    layers.Conv2D(64, (3, 3), activation="relu"),
    layers.MaxPooling2D((2, 2)),

    layers.Conv2D(128, (3, 3), activation="relu"),
    layers.Flatten(),

    layers.Dense(128, activation="relu"),
    layers.Dense(num_classes, activation="softmax")
])

model.compile(
    optimizer="adam",
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"]
)

model.summary()

# Suppose (X_train, y_train) and (X_test, y_test) are preprocessed image tensors
# model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.1)

This defines a basic CNN suitable for small image classification tasks, using convolution, pooling, and dense layers.