Reading Notes: ImageNet Classification with Deep Convolutional Neural Networks

5 min readDec 29, 2024

The original paper link is here. This reading note will specifically focus on the model architecture.

A large, deep convolutional neural network was trained to categorize the 1.2 million high-resolution images from the ImageNet LSVRC-2010 competition into 1000 distinct classes. On the test data, it achieved the top-1 and top-5 error rates of 37.5% and 17.0% respectively, significantly outperforming previous state-of-the-art models. The neural network comprises 60 million parameters and 650,000 neurons, featuring five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. The use of ReLU activation and dropout regularization methods proved highly effective.

The Dataset

ImageNet is a dataset containing over 15 million labeled high-resolution images spread across approximately 22,000 categories. On ImageNet, two error rates are typically reported: top-1 and top-5. The top-5 error rate measures the fraction of test images for which the correct label is not among the five labels deemed most probable by the model.

Since ImageNet includes images of varying resolutions, while neural networks require a constant input dimensionality, we downsampled all images to a fixed resolution of 256*256. For rectangular images, we first rescaled them so that the shorter side measured 256 pixels, and then cropped the central 256*256 patch. Other than subtracting the mean pixel value from each image based on the training set, we did not apply any additional preprocessing.

The Architecture

The model contains eight learned layers — five convolutional and three fully-connected.

ReLU Nonlinearity

Deep convolutional neural networks utilizing ReLUs achieve training speeds several times faster than those employing tanh units. This accelerated learning significantly enhances the performance of large models when applied to extensive datasets.

Training on Multiple GPUs

The network is distributed across two GPUs. The parallelization strategy employed allocates half of the kernels (or neurons) to each GPU, with a special consideration: inter-GPU communication occurs only at specific layers. For instance, kernels in layer 3 receive input from all kernel maps in layer 2. In contrast, kernels in layer 4 receive input only from kernel maps in layer 3 that reside on the same GPU. Determining the connectivity pattern poses a challenge for cross-validation, but this method allows us to finetune the communication load to maintain it at an acceptable proportion relative to the computation.

Local Response Normalization

ReLUs possess the advantageous characteristic of not needing input normalization to avoid saturation. Nonetheless, the following local normalization schema still enhances generalization.

a is the activity of a neuron computed by applying kernel i at position (x, y) and then applying the ReLU nonlinearity.

The denominator of the response-normalized activity b sums across n “adjacent” kernel maps at the same spatial position, with N being the total number of kernels in the layer. The sequence of the kernel maps is arbitrarily defined and set prior to training. The constants k, n, \alpha, and \beta are hyperparameters, their values finetuned using a validation set. This normalization is applied post-ReLU in specific layers.

Overlapping Pooling

Pooling layers in CNNs aggregate the outputs from neighboring neuron groups within the same kernel map. Traditionally, the regions summarized by adjacent pooling units do not overlap. Specifically, a pooling layer can be visualized as a grid of pooling units spaced s pixels apart, each summarizing a neighborhood of size z*z centered at the location of the pooling unit. Setting s = z results in traditional local pooling commonly used in CNNs. Setting s < z results in overlapping pooling. Throughout our network, we employ overlapping pooling with s = 2 and z = 3. During training, we typically observe that models with overlapping pooling are slightly less prone to overfitting.

Overall Architecture

The network comprises eight layers; the first five are convolutional and the last three are fully connected. The final fully connected layer’s output is directed to a 1000-way softmax , generating a probability distribution over the 1000 class labels. The network aims to maximize the multinomial logistic regression objective, which equtes to maximizing the average log-probability of the correct label across training cases under the prediction distribution.

As discussed in the context of GPU parallelization, kernels in certain layers (such as layer 4 ) receive input exclusively from the kernel maps in the preceding layer (e.g. layer 3) that are located on the same GPU.

Reducing Overfitting

Data Augmentation

The simplest and most prevalent technique to reduce overfitting on image data involves artificially expanding the dataset through label-preserving transformations.

The first type of data augmentation includes creating image translations and horizontal reflections.

The second type involves modifying the intensities of the RGB channels in training images.

Dropout

Dropout involves zeroing out the output of each hidden neuron with a probability of 0.5. Neurons ‘dropped out’ in this manner do not contribute to the forward pass and are excluded from back-propagation. Consequently, each time an input is presented, the neural network samples a different architecture, while sharing the same weights. This technique mitigates complex co-adaptations among neurons, as a neuron cannot depend on the presence of specific other neurons.

However, dropout approximately doubles the number of iterations needed for convergence.

Details of learning

The model was trained using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9 and weight decay of 0.0005. This slight weight decay was crucial for the model’s learning process. In this context, weight decay acts not only as a regularizer but also helps to reduce the model’s training error. The update rule for weight w was as follows