Fully Convolutional Networks for Semantic Segmentation



  1. pixelwise prediction
  2. superwised learning

Dense feedforward computation and backpropagation. Upsampling layers enable pixelwise prediction and learning.

Fully Convolutional Networks

Each layer of data in a convnet has size of $h\times w\times d$.

  • $w$ and $h$: spatial dimensions
  • $d$: feature or channel dimension
  • Receptive Fields: Locations in higher layers correspond to the locations in the image they are path-connected to.

Convnets are built on translation invariance. Their basic components (convolution, pooling, and activation function) operate on local input regions and depend only relative spatial coordinates.

  • $x_{ij}$: data vector at location $(i,j)$ in a particular layer
  • $y_{ij}$: following layer computed by:
  • $k$: kernel size
  • $s$: stride or subsampling factor
  • $f_{ks}$: layer type: a matrix multiplication for convolution or average pooling, a spatial max for max pooling, or an elementwise nonlinearity for an activation function.

Kernel size and stride obeying the transformation rule: [ f_{ks}\circ g_{k^\prime s^\prime} = (f\circ g)_{k^\prime + (k-1)s^\prime, ss^\prime\cdot} ] It computes a nonlinear filter, which is called deep filter or fully convolutional network. Thus FCN operates on an input of any size and produces an output of corresponding spatial dimensions.

Loss function is a sum over the spatial dimentions of the final layer: [ \ell(\mathbf{x};\theta) = \sum_{ij}\ell^\prime (\mathbf{x}_{ij};\theta) ] its gradient will be a sum over the gradients of each of its spatial components. Thus stochastic gradient descent on $\ell$ computed on whole images will be the same as stochastic gradient descent on $\ell^\prime$, taking all of the final layer receptive fields as a minibatch.

When receptive fields overlap significantly, feedforward computation and backprop are much more efficient when computed layer-by-layer over an entire image instead of independently patch-by-patch.

Adapting classifiers for dense prediction

Fully connected layers throw away spatial coordinates. These fully connected layers can also be viewed as convolutions with kernels that cover their entire input regions.

While the resulting maps are equivalent to the evaluation of the original net on particular input patches.


Leave a Comment