convolutional neural networks (CNN)

Local Connectivity: Each output depends only on a small input patch (reduces parameters).
Parameter Sharing: The same filter scans the entire image (efficient computation).

convolution - apply filters to get features
1. Example: A 5×5×3 filter applied to a 32×32×3 input produces a 28×28×1 output (no padding, stride=1).
Stride and padding
- Stride: Controls how much the filter shifts (reduces output size).
- Padding: Adds zeros to maintain spatial dimensions.
Pooling (downsampling)
1. reduces size while preserving important features (like max pooling - taking max of a certain window (set filter size K and stride S))

hierarchical feature learning
- early layers do Edge Detection
- middle layers do corner detection
- deeper layers get semantic meaning
Example: AlexNet (2012) – First CNN to outperform traditional methods by a large margin.

conv layer needs 4 hyperparams

num filters $C_{2}$ (output channels)
filter size K
stride S
zero padding P produces output of $W_{2} \times H_{2} \times C_{2}$
$W_{2} = (W_{1} - K + 2 P) / S + 1$
$H_{2} = (H_{1} - K + 2 P) / S + 1$ num params: $K^{2} C_{1} C_{2}$ and $C_{2}$ biases

convolve filter with image, computing dot prods - filters always extend full depth of input volume

param configs
- Gradually reduce spatial dimensions while increasing channels to balance computation.
- Example: 224×224×3 → 55×55×48 → 13×13×192.
Automatically learn hierarchical representations, similar to human vision.
Replace handcrafted features (e.g., HOG, SIFT) with data-driven filters.

one filter ⇒ one activation map

jennypng