Multi-scale context aggregation by dilated convolutions

MULTI-SCALE CONTEXT AGGREGATION BY DILATED CONVOLUTIONS

空洞卷积

1. Introduction

Dense prediction

The goal is to compute a discrete or continuous label for each pixel in the image. A prominent example is to semantic segmentation , which calls for classifying each pixel into one of a given set of categories.

Semantic segmentation is challenging because it requires combining pixel-level accuracy with multi-scale contextual reasoning.

像素精度与上下文推理相结合

Modern image classification networks integrate multi-scale contextual information via successive pooling and subsampling layers that reduce resolution until a global prediction is obtained.

Dense prediction calls for multi-scale contextual reasoning in combination with full-resolution output.

A network module that aggregates multi-scale contextual information without losing resolution or analyzing rescaled image.The module can be plugged into existing architectures at any resolution.

is designed specifically for dense prediction

It is a rectangular prism of convolution layers, with no pooling or subsampling.

The module is based on dilated convolutions, which support exponential expansion of the receptive field without loss of resolution or coverage.

2.Dilated Convolution

3.Multi-scale context aggregation

The module takes C features maps as input and produces C feature maps as output. The input and output have the same form, thus the module can be plugged into existing dense prediction architecture.

Found that random initalization schemes were not effective for the context module. Found an alternative initialization with clear semantics to be much more effective.

This initialization sets all filters such that each layer simply passes the input directly to the next.

FCN fully convolution network 全卷积

预测一副图像中的所有像素点的类别

将全连接层都改成卷积层

卷积化convolutionalization：全卷积会不断降低分辨率，需要反卷积来提高分辨率
反卷积deconvolution
- 外围全补0：在低分辨率的图片外围插0，与卷积核卷积输出高分辨率图片
- 内部插0
得到的分割结果比较粗糙，所以要使用跳层结构结合前两个卷积层的输出做融合
跳层结构skip-layer ?

语义分割 Semantic segmentation

像素级 地识别图像，标注图像中每个像素所属的对象类别

patch classification图像块分类，即利用像素周围的图像块对每一个像素进行独立的分类，使用图像块分类的主要原因是分类网络通常是全连接层full connection layer，且要求固定尺寸
FCN全卷积网络

除了全连接层，使用卷积神经网络进行语义分割存在的一大问题是池化层，池化层不仅扩大感受野、聚会语境从而造成了位置信息的丢失

问题的解决：

U-Net 编码器与解码器结构：编码器逐步减少池化层的空间维度，解码器逐步修复物体的细节和空间维度，编码器与解码器之间通常存在快捷连接，因此能帮助解码器更好地修复目标的细节
dilated convolution 空洞/带孔卷积