Dimensionality in Convolutional Neural Network (CNN)
In Convolutional Neural Networks (CNNs), understanding the dimensions of the input, filters (kernels), and the output feature maps is crucial for designing and debugging models.
Input Dimensions
The input to a CNN is typically an image or a batch of images. For a single image, the dimensions are represented as:
- Height (H): The number of rows in the image.
- Width (W): The number of columns in the image.
- Channels ©: The number of color channels (e.g., 1 for grayscale, 3 for RGB).
For a batch of images, the dimensions are:
- Batch Size (N): The number of images in the batch.
- Height (H): The height of each image.
- Width (W): The width of each image.
- Channels ©: The number of color channels.
For example, an RGB image of size 32x32 pixels would have dimensions (32,32,3), and a batch of 10 such images would have dimensions (10, 32, 32, 3).
2. Filter (Kernel) Dimensions
Filters (or kernels) in CNNs are used to extract features from the input images. The dimensions of a filter are represented as:
- Height (kH): The number of rows in the filter.
- Width (kW): The number of columns in the filter.
- Channels ©: The number of input channels (which must match the number of channels in the input image).
Additionally, you have:
- Number of Filters (F): The number of filters applied, which determines the depth of the output feature map.
For example, if you have 5 filters of size 3x3, each filter would have dimensions (3,3,C) where C is the number of input channels. If there are 3 input channels (for RGB), each filter would have dimensions (3,3,3).
3. Output Dimensions
The output of a convolutional layer is a set of feature maps. The dimensions of the output feature maps depend on the input dimensions, filter dimensions, stride, and padding.
- Height (H_out): The height of the output feature map.
- Width (W_out): The width of the output feature map.
- Depth (D_out): The depth of the output feature map, which is equal to the number of filters.
The height and width of the output feature map can be calculated using the following formulas:
Where:
- H = height of the input.
- W = width of the input.
- kH = height of the filter.
- kW = width of the filter.
- P = padding applied to the input.
- S = stride of the filter.
Let’s consider an example to clarify these concepts:
- Input dimensions: (64, 64, 3) (a 64x64 RGB image).
- Filter dimensions: (3, 3, 3) (a 3x3 filter with 3 channels).
- Number of filters: 16.
- Stride: 1.
- Padding: 1 (to keep the output dimensions the same as the input).
Using the formulas:
H_out=⌊64+2×1−31⌋+1=64
W_out=⌊64+2×1−31⌋+1=64
So, the output feature map dimensions will be (64,64,16) where 16 is the number of filters.