Deep Residual Learning for Image Recognition

Deep Residual Learning for Image Recognition

Q:Is learning better networks as simple as stacking more layers?

A:Noooo

​ An obstacle to answering this question was the notorious problem of vanishing/exploding gradients , which hamper convergence from the beginning .

This problem has been largely addressed by normalized initialization and intermediate normalized layers , which enable networks with tens of layers to start converging for SGD with bp.

Resnet_1

常规的网络的堆叠(plain network)在网络很深的时候,效果却却越来越差,train set和test set都变差。

产生这一问题的原因之一即 deeper network,gradients vanish越明显,network的训练效果也不是很好。

Resnet network的目标是在网络加深的情况下解决gradients vanish的问题。


method:

Plain_net

Resnet_plain net

Resnet network

Resnet network

Identity mapping :X*I=X

20 layers -> 20 layers + 36 identity layers

There existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart .

shortcut connection :H(x)=F(x)+x,skip one or more layers

Results:

  • Extremely deep residual nets are easy to optimize , but the counterpart “plain” nets(that simply stack layers ) exhibit higher training error when the depth increases
  • Deep residual nets can easily enjoy accuracy gains from greatly increased depth , producing results substantially better than previous networks

Residual Network:

H(x) – an underlying mapping to be fit by a few stacked layers (not necessarily the entire net) , with x denoting the inputs to the first of these layers . 若干堆叠层将进行拟合的映射

If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions , then it is equivalent to hypothesize that they can asymptotically approximate the residual functions , i.e. H(x)-x

直接映射是难以学习的,提出了一种修正方法,不再学习x到H(x)的基本映射关系,而是学习这两种关系的差异,也就是残差residual,然后,为了计算H(x),我们只需要将这个残差加到输入上即可。

假设残差为F(x)=H(x)-x,那么网络不会直接学习H(x),而是学习F(x)+x

Identity Mapping by Shortcuts:

When the input and output are of the same dimensions ,

y=F(x,{Wi})+x,y and x are output and input vectors of the layers.

If not , 1)perform a linear projection Ws by the short connections to match the dimensions:y=F(x,{Wi})+Ws*x

2)The shortcut still performs identity mapping , with extra zero padded for increasing dimensions

如果输入和输出的维度不同,使用零填充或投射(通过1*1卷积)来得到匹配的大小