Computer Vision (CV) is the field of AI that enables machines to understand and interpret images and videos.
Typical goals include recognizing objects, detecting faces, reading text, and understanding scene structure.
Digital images can be seen as tensors:
Each pixel stores intensity values (e.g., 0–255), which become numerical features for our models.
Many modern systems are built on Convolutional Neural Networks (CNNs).
CNNs exploit two key ideas:
This makes CNNs:
A typical block:
Stacking several blocks builds hierarchical features (edges → corners → textures → object parts → objects).
Good performance in CV depends heavily on data:
Augmentation helps networks generalize better and reduces overfitting, especially when labeled data are limited.
from tensorflow import keras
from tensorflow.keras import layers
# Example: small CNN for image classification (e.g., CIFAR-10-like)
input_shape = (32, 32, 3) # height, width, channels
num_classes = 10
model = keras.Sequential([
layers.Conv2D(32, (3, 3), activation="relu", input_shape=input_shape),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(64, (3, 3), activation="relu"),
layers.MaxPooling2D((2, 2)),
layers.Conv2D(128, (3, 3), activation="relu"),
layers.Flatten(),
layers.Dense(128, activation="relu"),
layers.Dense(num_classes, activation="softmax")
])
model.compile(
optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"]
)
model.summary()
# Suppose (X_train, y_train) and (X_test, y_test) are preprocessed image tensors
# model.fit(X_train, y_train, epochs=10, batch_size=64, validation_split=0.1)
This defines a basic CNN suitable for small image classification tasks, using convolution, pooling, and dense layers.