Reading Notes: Identity Mappings in Deep Residual Networks

4 min readDec 31, 2024

The original paper link is here. This reading note will specifically focus on the model architecture.

In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation.

Introduction

The central idea of ResNets is to learn the additive residual function F with respect to h(x_l), with a key choice of using an identity mapping h(x_l) = x_l. In this paper, our derivations reveal that if both h(x_l) and f(y_l) are identity mappings, the signal could be directly propagated from one unit to any other units, in both forward and backward passes. The experiments empirically show that training in general becomes easier when the architecture is closer to the above two conditions.

To construct an identity mapping f(y_l)=y_l, we view the activation functions (ReLU and BN) as “pre-activation” of the weight layers, in contrast to conventional wisdom of “post-activation.” This point of view leads to a new residual unit design as below.

Analysis of Deep Residual Networks

The ResNets are modularized architectures that stack building blocks of the same connecting shape. In this paper, we call these blocks “Residual Units.” The original Residual Unit performs the following computations

If f is also an identity mapping, we will obtain

Recursively, we will have

The above equation exhibits some nice properties

The feature x_L of any deeper unit L can be represented as the feature x_l of any shallower unit l plus a residual function, indicating that the model is in a residual fashion between any units L and l.
The feature x_L is the summation of the outputs of all preceding residual functions (plus x_0). This is in contrast to a “plain network” where a feature x_L is a series of matrix-vector products (ignoring BN and ReLU).
Nice backward propagation properties. Denoting the loss function as 𝜀, from the chain rule of backpropagation, we have

This indicates that the gradient can be decomposed into two additive terms, the first one propagates information directly without concerning any weight layers, and another term that propagates through the weight layers. The additive term of the first term ensures that information is directly propagated back to any shallower unit l.

On the Importance of Identity Skip Connections

Let’s consider a simple modification to explore the Let h(x_l) = 𝜆_l x_l , to break the identity shortcut:

In the above derivative formula, the first additive term is modulated by a factor. For an extremely deep network, if 𝜆_i > 1 for all i, this factor can be exponentially large; if 𝜆_i <1 for all i, this factor can be exponentially small and vanish, which blocks the backpropagated signal from the shortcut and forces it to flow through the weight layers.

Experiments on Skip Connections

In this section, we conduct experiments to explore the effects of the identity skip connection h(x_l) = x_l not being maintained. Various types of shortcut connections are evaluated.

Constant scaling

Exclusive gating

Shortcut-only gating

It is important to note that gating and 1×1 convolutional shortcuts introduce additional parameters, which should enhance their representational capabilities compared to identity shortcuts. Interestingly, gating and 1×1 convolution shortcuts encompass the solution space of identity shortcuts (i.e., they can be optimized to function as identity shortcuts). However, their training errors are higher than those of identity shortcuts, suggesting that the performance degradation of these models is due to optimization issues rather than representational limitations.

On the Usage of Activation Functions

Experiments on Activation

Analysis

We find the impact of pre-activation is twofold. First, the optimization is further eased (comparing with the baseline ResNet) because f is an identity mapping. Second, using BN as pre-activation improves regularization of the models.