Convolutional nets and watershed cuts for real-time semantic labeling of RGBD videos

Camille Couprie, Clément Farabet, Laurent Najman, Yann LeCun

Research output: Contribution to journalArticle

Abstract

This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on handcrafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. Using a frame by frame labeling, we obtain nearly state-of-the-art performance on the NYU-v2 depth data set with an accuracy of 64.5%. We then show that the labeling can be further improved by exploiting the temporal consistency in the video sequence of the scene. To that goal, we present a method producing temporally consistent superpixels from a streaming video. Among the different methods producing superpixel segmentations of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time by using a minimum spanning tree. In a framework exploiting minimum spanning trees all along, we propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real-time applications. We illustrate the labeling of indoor scenes in video sequences that could be processed in real-time using appropriate hardware such as an FPGA.

Original languageEnglish (US)
Article numberA20
Pages (from-to)3489-3511
Number of pages23
JournalJournal of Machine Learning Research
Volume15
StatePublished - Jan 1 2015

Fingerprint

Watersheds
Labeling
Semantics
Minimum Spanning Tree
Real-time
Segmentation
Video Segmentation
Data Depth
Video Streaming
Video streaming
Multi-class
Field Programmable Gate Array
Field programmable gate arrays (FPGA)
Linear Time
Pixel
Pixels
Hardware
Graph in graph theory

Keywords

  • Convolutional networks
  • Deep learning
  • Depth information
  • Optimization
  • Superpixels

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Cite this

Convolutional nets and watershed cuts for real-time semantic labeling of RGBD videos. / Couprie, Camille; Farabet, Clément; Najman, Laurent; LeCun, Yann.

In: Journal of Machine Learning Research, Vol. 15, A20, 01.01.2015, p. 3489-3511.

Research output: Contribution to journalArticle

Couprie, Camille ; Farabet, Clément ; Najman, Laurent ; LeCun, Yann. / Convolutional nets and watershed cuts for real-time semantic labeling of RGBD videos. In: Journal of Machine Learning Research. 2015 ; Vol. 15. pp. 3489-3511.
@article{12c42e5586194111acf0d7b9683bcb81,
title = "Convolutional nets and watershed cuts for real-time semantic labeling of RGBD videos",
abstract = "This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on handcrafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. Using a frame by frame labeling, we obtain nearly state-of-the-art performance on the NYU-v2 depth data set with an accuracy of 64.5{\%}. We then show that the labeling can be further improved by exploiting the temporal consistency in the video sequence of the scene. To that goal, we present a method producing temporally consistent superpixels from a streaming video. Among the different methods producing superpixel segmentations of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time by using a minimum spanning tree. In a framework exploiting minimum spanning trees all along, we propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real-time applications. We illustrate the labeling of indoor scenes in video sequences that could be processed in real-time using appropriate hardware such as an FPGA.",
keywords = "Convolutional networks, Deep learning, Depth information, Optimization, Superpixels",
author = "Camille Couprie and Cl{\'e}ment Farabet and Laurent Najman and Yann LeCun",
year = "2015",
month = "1",
day = "1",
language = "English (US)",
volume = "15",
pages = "3489--3511",
journal = "Journal of Machine Learning Research",
issn = "1532-4435",
publisher = "Microtome Publishing",

}

TY - JOUR

T1 - Convolutional nets and watershed cuts for real-time semantic labeling of RGBD videos

AU - Couprie, Camille

AU - Farabet, Clément

AU - Najman, Laurent

AU - LeCun, Yann

PY - 2015/1/1

Y1 - 2015/1/1

N2 - This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on handcrafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. Using a frame by frame labeling, we obtain nearly state-of-the-art performance on the NYU-v2 depth data set with an accuracy of 64.5%. We then show that the labeling can be further improved by exploiting the temporal consistency in the video sequence of the scene. To that goal, we present a method producing temporally consistent superpixels from a streaming video. Among the different methods producing superpixel segmentations of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time by using a minimum spanning tree. In a framework exploiting minimum spanning trees all along, we propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real-time applications. We illustrate the labeling of indoor scenes in video sequences that could be processed in real-time using appropriate hardware such as an FPGA.

AB - This work addresses multi-class segmentation of indoor scenes with RGB-D inputs. While this area of research has gained much attention recently, most works still rely on handcrafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. Using a frame by frame labeling, we obtain nearly state-of-the-art performance on the NYU-v2 depth data set with an accuracy of 64.5%. We then show that the labeling can be further improved by exploiting the temporal consistency in the video sequence of the scene. To that goal, we present a method producing temporally consistent superpixels from a streaming video. Among the different methods producing superpixel segmentations of an image, the graph-based approach of Felzenszwalb and Huttenlocher is broadly employed. One of its interesting properties is that the regions are computed in a greedy manner in quasi-linear time by using a minimum spanning tree. In a framework exploiting minimum spanning trees all along, we propose an efficient video segmentation approach that computes temporally consistent pixels in a causal manner, filling the need for causal and real-time applications. We illustrate the labeling of indoor scenes in video sequences that could be processed in real-time using appropriate hardware such as an FPGA.

KW - Convolutional networks

KW - Deep learning

KW - Depth information

KW - Optimization

KW - Superpixels

UR - http://www.scopus.com/inward/record.url?scp=84919724791&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84919724791&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84919724791

VL - 15

SP - 3489

EP - 3511

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

SN - 1532-4435

M1 - A20

ER -