Reading Notes: Deep Residual Learnings for Image Recognition

4 min readDec 30, 2024

The original paper link is here. This reading note will specifically focus on the model architecture.

We explicitly redesign the layers to learn residual functions relative to the layer inputs, rather than learning functions without reference. We offer extensive empirical evidence demonstrating that these residual networks are easier to optimize and can achieve higher accuracy with substantially increased depth.

Introduction

Training deeper neural networks poses several challenges. One significant hurdle in adding more layers is the vanishing/exploding gradients problem, which impedes convergence from the outset. The issue has been largely mitigated by using normalized initialization and intermediate normalization layers, allowing networks with tens of layers to begin converging under stochastic gradient descent (SGD) with back-propagation.

However, once deeper networks begin to converge, another issue arises: with increasing depth, the accuracy initially improves but eventually saturates and then declines sharply. Interestingly, this degradation is not due to overfitting. Instead, adding more layers to an already deep model results in higher training errors.

The degradation in training accuracy suggests that not all systems are equally easy to optimize. In this paper, we tackle the degradation issue by introducing a deep residual learning framework. By denoting the desired mapping from input to output as H(x), we allow the stacked nonlinear layers to fit the residual function F(x):=H(x)-x instead of H(x) directly. We hypothesize that optimizing the residual function F(x) is easier than optimizing the original mapping H(x). In the extreme case where an identity mapping is optimal, it would be more straightforward to push the residual to zero than to fit an identity mapping with a stack of nonlinear layers.

The formulation F(x)+x can be implemented through feedforward neural networks with identity shortcut connections. These connections do not introduce any additional parameters or computational complexity. The entire network can still be trained end-to-end using SGD with backpropagation and can be easily implemented using standard libraries without modifying the solvers.

We conduct extensive experiments on ImageNet to demonstrate the degradation problem and assess our approach. Our findings reveal that : 1) Our extremely deep residual networks are easy to optimize, whereas their ‘plain’ counterparts (which simply stack layers) exhibit higher training error as depth increases; 2) Our deep residual networks benefit significantly from increased depth, achieving substantially better accuracy than previous networks.

Similar results are observed on the CIFAR-10 dataset, indicating that the optimization challenges and the effects of our methods are not unique to a specific dataset.

Deep Residual Learning

Residual Learning

By denoting the desired mapping from input to output as H(x), we configure the stacked nonlinear layers to fit F(x):= H(x)-x rather than H(x) directly, assuming the input and output have the same dimensions. The learning process might differ between fitting H(x) directly and fitting F(x)+x.

If the optimal function is closer to an identity mapping rather than a zero mapping, it is likely easier for the solver to identify perturbations relative to an identity mapping than to learn the function from scratch.

Identity Mapping by Shortcuts

Formally, in this paper, we define a building block as follows:

Here, x and y represent the input and output vectors of the layers under consideration. The function F(x, {W_i}) represents the residual mapping to be learned.

The dimensions of x and F must be equal in the above equation. If this condition is not met, we can use a linear projection W_s via shortcut connections to match the dimensions.

A square matrix W_s can also be used in the first equation. However, experiments have demonstrated that the identity mapping is effective in addressing the degradation problem and is more economical. Consequently,W_s is only employed when dimension matching is necessary.

Network Architectures

Plain Network & Residual Network

Implementation

We implement batch normalization immediately after each convolution and before activation. We initialize the weights and train all plain and residual networks from scratch. Using SGD with a min-batch size of 256, the learning rate starts at 0.1 and is reduced by a factor 10 when error plateaus. The models are trained for up to 600,000 iterations. We apply a weight decay of 0.0001 and a momentum of 0.9. Dropout is not utilized.

Experiments

ImageNet Classification

CIFAR-10 and analysis

Analysis of Layer Responses: Figure 7 illustrates that ResNets typically exhibit smaller responses compared to their plain counterparts. These findings support our fundamental premise that residual functions are generally closer to zero than non-residual functions. Additionally, we observe that deeper ResNets display smaller response magnitudes, as seen in the comparisons among ResNet-20, ResNet-56, and ResNet-110. With an increased number of layers, each individual layer in ResNets tends to alter the signal to a less extent.