. Semantic Segmentation Reading List

Semantic Segmentation Reading List

Semantic segmentation is the task that requires classifying each pixel into one of a given set of categories.
We need a large field of view to do classification, therefore many people use pooling layer and striding to reduce the feature map in order to obtain global information. However, this will lost details about the picture and these details are very important due to the semantic segmentation. People try different ways to obtain global feature map and keep local details.
Althought we try to keep the local details, the semantic segmentation output of DNN is very smooth. In order to recover accurate boundary, people try different ways. The first approach is to harness information from multiple layers in the convolutional network in order to better estimate the object boundaries. The second is to employ a super-pixel representation, essentially delegating the localization task to a lowlevel segmentation method. The third is based on coupling the recognition capacity of DCNNs and the fine-grained localization accuracy of fully connected CRFs.

Based on Hypercolumns

  • Hypercolumns for Object Segmentation and Fine-grained Localization. (2014) ☻
    This paper bring up Hypercolumns method and represente them as a neural network.
  • PixelNet: Representation of the pixels, by the pixels, and for the pixels. (2016)
    This paper is based on above paper. There is another version of this paper:
    PixelNet: Towards a General Pixel-Level Architecture.
    There is another paper based on this work (use the same network to do different task):
    Marr Revisited: 2D-3D Alignment via Surface Normal Prediction.
  • SuperPixels: Feedforward semantic segmentation with zoom-out features. (2015)
    This paper extends Hypercolumns method into super-pixel by combining several neighbor pixels and building a hierarchical structure.

Based on Dilated Conv

  • Dilated-Net: Multi-scale contest aggregation by dilated convolutions. (2016) ☻
    This paper use dilated Conv to connect coarse outputs to dense pixels.
  • DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. ☻
    This paper use conditional random field (CRF) to do inference. The paper also use dilated Conv (Atrous Convolution) to remove several pooling layers and use Atrous Spatial Pyramid Pooling (ASPP) as multiscale image representations. The following paper (and its supplementary) is useful to understand CRFs.
    Efficient inference in fully connected crfs with gaussian edge potentials. (2012)
  • DeepLab v3: Rethinking Atrous Convolution for Semantic Image Segmentation. (2017)
  • Learning Dense Convolutional Embeddings for Semantic Segmentation. (2016)
    This paper combine DeepLab with a embedding-based segmentation mask which try to sharpen the boundary.
  • PSPNet: Pyramid Scene Parsing Network. (2017) ☻
    This paper based on Dilated-Net and use Deconv at the same time. This paper also bring up a Pyramid Pooling Module.
  • ICNet: ICNet for Real-Time Semantic Segmentation on High-Resolution Images. (2017)
    This paper based on PSPNet and use several ways to speed up this PSPNet.
  • SegAware: Segmentation-Aware Convolutional Networks Using Local Attention Masks. (2017) ☻
    This paper modify the DeepLab by proposing a segmentation-aware convolutional networks to address the issue of smoothness. In this work, it brings up the pairwise loss, segmentation mask, segmentation-aware smoothing, segmentation-aware CRFs and segmentation-aware convolution.
  • DUC: Understanding Convolution for Semantic Segmentation. (2017)
    In decoding, this paper brings up a DUC element which is extremely easy to implement and can achieve pixel-level accuracy (similar ESPCN in SR). In encoding part, this paper change Dilated Conv to a HDC element by modifing the same rate of dilation to different rate.

Based on Deconv

  • U-Net: Convolutional Networks for Biomedical Image Segmentation.
  • FCN: Fully Convolutional Networks for Semantic Segmentation. (2016) ☻
    This paper compare three different upsampling methods (Shift-and-stitch, dilated Conv and deconv). It also discuess two different trianing methods (patchwise training and full convolutional training). The network is difficult to train, so it bring up the multi-stage training approach. The following paper is helpful to understand Shift-and-stitch method.
    Fast image scanning with deep max-pooling convolutional neural networks. (2013)
    For the decov, please refer these two papers:
    Adaptive deconvolutional networks for mid and high level feature learning. (2011)
    Visualizing and understanding convolutional networks. (2014)
  • DeconvNet: Learning Deconvolution Network for Semantic Segmentation. (2015)
    The proposed network is very similiar with U-Net. It has Convolution network (modified VGG-16) and Deconvolution network. It also use unpooling layers and analyse the different functions with deconvolution layers. This paper use two-stage training method which train the network with easy examples first and fine-tune the trained network with hard examples.
  • DecoupledNet: Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation. (2015) ☻
    This paper use semi-supervised learning approach to train. This algorithm decouples classification and segmentation which labels associated with an image are identified by classification network, and binary segmentation is subsequently performed for each identified label in segmentation network. It facilitates to reduce search space for segmentation effectively by exploiting class-specific activation maps obtained from bridging layers.
  • SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation.
    This SegNet has a similiar network structure with U-Net and DeconvNet. However, it uses a different upsample method. In encoder network, the locations of the maximum feature value in each pooling window is memorized for each encoder feature map. In decoder network upsamples its input feature map(s) using the memorized max-pooling indices from the corresponding encoder feature map(s). These feature maps are then convolved with a trainable decoder filter bank to produce dense feature maps.
  • ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. (2016)
    This paper creat a deep neural network named ENet specifically for tasks requiring low latency operation. It analyse several aspects to speed up the neural network and this analysis is useful for us to rethink the structure of neural network. For example, it shows that we don't need to design the same number of layers in decoder part comparing with encoder part.
  • ERFNet: Efficient ConvNet for Real-time Semantic Segmentation. (2017)
    This paper bring up a real-time semantic segmentation network and they use similar ways to speed up calculation with ENet.
  • RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation. (2016)
    This paper use multi-path to connect low-level and high-level features.
  • FRRN: Full-Resolution Residual Networks for Semantic Segmentation in Street Scenes. (2016)
    This paper combines multi-scale context with pixel-level accuracy by using two processing streams within our network: One stream carries information at the full image resolution, enabling precise adherence to segment boundaries. The other stream undergoes a sequence of pooling operations to obtain robust features for recognition. The two streams are coupled at the full image resolution using residuals.
  • LinkNet: Exploiting Encoder Representations for Efficient Semantic Segmentation. (2017)
    This paper design a small and fast network by linking each encoder with decoder.
  • GCN: Large Kernel Matters: Improve Semantic Segmentation by Global Convolutional Network. (2017)
    This paper proposes a Global Convolutional Network and Boundary Refinement block to address both the classification and localization issues for the semantic segmentation.
  • Tiramisu: The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation. (2017)
    This paper extends DenseNets to deal with the problem of semantic segmentation and achieves state of the art.

Include RNN network

  • DAG-RNNs: DAG-Recurrent Neural Networks For Scene Labeling. (2015)
    In this paper, recurrent neural networks (RNNs) are introduced to equip local features with a broader view of contextual awareness by modeling the contextual dependencies of local features. They adopt undirected cyclic graphs (UCG) to model the interactions among image units and decompose the UCG to several directed acyclic graphs (DAGs).

Weakly-supervised learning approach

  • Constrained convolutional neural networks for weakly supervised segmentation
  • Fully convolutional multi-class multiple instance learning.
  • From image-level to pixel-level labeling with convolutional networks.
  • What's the Point: Semantic Segmentation with Point Supervision. (2016)
    The first contribution of this paper is a novel supervision regime for semantic segmentation based on humans pointing to objects. The second contribution is incorporating an generic objectness prior directly in the loss to guide the training of a CNN.

Semi-supervised learning approach

  • DecoupledNet: Decoupled Deep Neural Network for Semi-supervised Semantic Segmentation. (2015) ☻
  • DIC: Deep Dual Learning for Semantic Image Segmentation. (2016) ☻
  • Weakly-and semi-supervised learning of a dcnn for semantic image segmentation
  • Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. (2016)
  • Boxsup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation.

Use CRF/MRF to Refine Prediction

  • Efficient inference in fully connected crfs with gaussian edge potentials. (2012)
  • DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. ☻
  • BNF: Semantic Segmentation with Boundary Neural Fields. (2016)
    This paper to do semantic segmentation with the help of boundary information and it use boundary neural fields to do inference.
  • Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation. (2016) ☻
    This paper show how to improve semantic segmentation through the use of contextual information; specifically, they explore ‘patch-patch’ context between image regions, and ‘patch-background’ context. It brings up three networks: FeatMap-Net, Unary-Net and Pairwise-Net. This paper is worthy to carefully read.
  • DPN: Semantic Image Segmentation via Deep Parsing Network. (2015) ☻ ☻
    This paper include MRF into a single CNN network and gives a detail explanation about this. This paper also shows many existing deep model with CRF/MRF is a special case of this work. I strongly recommand you to read this paper. There is a extended version about this paper:
    Deep Learning Markov Random Field for Semantic Segmentation. (2017)
  • SegAware: Segmentation-Aware Convolutional Networks Using Local Attention Masks. (2017) ☻


  • OBG-FCN: Object Boundary Guided Semantic Segmentation. (2016)
    This paper use boundary information to guide semantic segmentation. It consists three parts, an object boundary prediction FCN (OBP-FCN), which gives us an accurate prior knowledge of object localizations and shape details; a designed masking architecture (OBG-Mask), and a segment prediction net (FCN).
  • Recalling Holistic Information for Semantic Segmentation. (2016)
  • LRN: Label Refinement Network for Coarse-to-Fine Semantic Segmentation. (2017)
    This paper uses an encoder-decoder framework. The encoder network of LRN is similar to that of SegNet which is based on the VGG16 network and the decoder network in LRN introduces supervision and makes predictions in a coarse-to-fine fashion at several stages.
  • LabelBank: Revisiting Global Perspectives for Semantic Segmentation. (2017)
    This paper is based on the above paper and uses holistic information to filter noisy low-level semantic segmentation by adding a holistic filter. In detail, it is composed of three components. First, they have a holistic inference process that takes varied information sources to reason about the LabelBank representation of an image. Second, they have a detailed semantic segmentation process to conduct preliminary semantic segmentation on the image to generate a segmentation map. Finally, they have a holistic filtering process that leverages the inferred LabelBank to filter out false-positive pixel predictions in the preliminary semantic segmentation results.
  • Semantic Image Segmentation with Task-Specific Edge Detection Using CNNs and a Discriminatively Trained Domain Transform. (2016) ☻
    This paper include three parts. The first component that produces coarse semantic segmentation score predictions is based on the DeepLab model. The second component, which we refer to as EdgeNet. The EdgeNet predicts edges by exploiting features from intermediate layers of DeepLab. The third component in our system is the domain transform (DT), which is is an edge-preserving filter that lends itself to very efficient implementation by separable 1-D recursive filtering across rows and columns. The DT component is used as filtering the raw CNN semantic segmentation scores to be better aligned with object boundaries, guided by the EdgeNet produced edge map.
  • BNF: Semantic Segmentation with Boundary Neural Fields. (2016)
    This paper to do semantic segmentation with the help of boundary information and it use boundary neural fields to do inference.


  • BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation. (2015) ☻
    This paper investigate bounding box annotations as an alternative or extra source of supervision to train convolutional networks for semantic segmentation. The candidate segments are used to update the deep convolutional network. The semantic features learned by the network are then used to pick better candidates. This procedure is iterated.
  • ParseNet: Looking Wider to See Better. (2015)
    This paper shows that although theoretically, features from the top layers of a network have very large receptive fields, we argue that in practice, the empirical size of the receptive fields is much smaller, and is not enough to capture the global context. The paper use Global average pooling -> upsample and then combine with local feature can solve this problem. This paper also shows that L2 normalization is important for achieving good performance.
  • Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade. (2017) ☻
    This paper proposes a novel deep layer cascade (LC) method to improve the accuracy and speed of semantic segmentation. LC classifies most of the easy regions in the shallow stage and makes deeper stage focuses on a few hard regions.