Deep End2End Voxel2Voxel Prediction

Du Tran, Lubomir Bourdev, Robert Fergus, Lorenzo Torresani, Manohar Paluri

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis. However, so far their most successful applications have been in the area of video classification and detection, i.e., problems involving the prediction of a single class label or a handful of output variables per video. Furthermore, while deep networks are commonly recognized as the best models to use in these domains, there is a widespread perception that in order to yield successful results they often require time-consuming architecture search, manual tweaking of parameters and computationally intensive preprocessing or post-processing methods. In this paper we challenge these views by presenting a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, i.e., to output a variable at every voxel of the video. Most importantly, we show that the same exact architecture can be used to achieve competitive results on three widely different voxel-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring. The three networks learned on these problems are trained from raw video without any form of preprocessing and their outputs do not require post-processing to achieve outstanding performance. Thus, they offer an efficient alternative to traditional and much more computationally expensive methods in these video domains.

Original languageEnglish (US)
Title of host publicationProceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016
PublisherIEEE Computer Society
Pages402-409
Number of pages8
ISBN (Electronic)9781467388504
DOIs
StatePublished - Dec 16 2016
Event29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016 - Las Vegas, United States
Duration: Jun 26 2016Jul 1 2016

Other

Other29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016
CountryUnited States
CityLas Vegas
Period6/26/167/1/16

Fingerprint

Optical flows
Coloring
Processing
Labels
Semantics
Deep learning

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering

Cite this

Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2016). Deep End2End Voxel2Voxel Prediction. In Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016 (pp. 402-409). [7789547] IEEE Computer Society. https://doi.org/10.1109/CVPRW.2016.57

Deep End2End Voxel2Voxel Prediction. / Tran, Du; Bourdev, Lubomir; Fergus, Robert; Torresani, Lorenzo; Paluri, Manohar.

Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016. IEEE Computer Society, 2016. p. 402-409 7789547.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Tran, D, Bourdev, L, Fergus, R, Torresani, L & Paluri, M 2016, Deep End2End Voxel2Voxel Prediction. in Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016., 7789547, IEEE Computer Society, pp. 402-409, 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016, Las Vegas, United States, 6/26/16. https://doi.org/10.1109/CVPRW.2016.57
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M. Deep End2End Voxel2Voxel Prediction. In Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016. IEEE Computer Society. 2016. p. 402-409. 7789547 https://doi.org/10.1109/CVPRW.2016.57
Tran, Du ; Bourdev, Lubomir ; Fergus, Robert ; Torresani, Lorenzo ; Paluri, Manohar. / Deep End2End Voxel2Voxel Prediction. Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016. IEEE Computer Society, 2016. pp. 402-409
@inproceedings{765ece793c4e4a05ac7b7cd9f4d4da66,
title = "Deep End2End Voxel2Voxel Prediction",
abstract = "Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis. However, so far their most successful applications have been in the area of video classification and detection, i.e., problems involving the prediction of a single class label or a handful of output variables per video. Furthermore, while deep networks are commonly recognized as the best models to use in these domains, there is a widespread perception that in order to yield successful results they often require time-consuming architecture search, manual tweaking of parameters and computationally intensive preprocessing or post-processing methods. In this paper we challenge these views by presenting a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, i.e., to output a variable at every voxel of the video. Most importantly, we show that the same exact architecture can be used to achieve competitive results on three widely different voxel-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring. The three networks learned on these problems are trained from raw video without any form of preprocessing and their outputs do not require post-processing to achieve outstanding performance. Thus, they offer an efficient alternative to traditional and much more computationally expensive methods in these video domains.",
author = "Du Tran and Lubomir Bourdev and Robert Fergus and Lorenzo Torresani and Manohar Paluri",
year = "2016",
month = "12",
day = "16",
doi = "10.1109/CVPRW.2016.57",
language = "English (US)",
pages = "402--409",
booktitle = "Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016",
publisher = "IEEE Computer Society",
address = "United States",

}

TY - GEN

T1 - Deep End2End Voxel2Voxel Prediction

AU - Tran, Du

AU - Bourdev, Lubomir

AU - Fergus, Robert

AU - Torresani, Lorenzo

AU - Paluri, Manohar

PY - 2016/12/16

Y1 - 2016/12/16

N2 - Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis. However, so far their most successful applications have been in the area of video classification and detection, i.e., problems involving the prediction of a single class label or a handful of output variables per video. Furthermore, while deep networks are commonly recognized as the best models to use in these domains, there is a widespread perception that in order to yield successful results they often require time-consuming architecture search, manual tweaking of parameters and computationally intensive preprocessing or post-processing methods. In this paper we challenge these views by presenting a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, i.e., to output a variable at every voxel of the video. Most importantly, we show that the same exact architecture can be used to achieve competitive results on three widely different voxel-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring. The three networks learned on these problems are trained from raw video without any form of preprocessing and their outputs do not require post-processing to achieve outstanding performance. Thus, they offer an efficient alternative to traditional and much more computationally expensive methods in these video domains.

AB - Over the last few years deep learning methods have emerged as one of the most prominent approaches for video analysis. However, so far their most successful applications have been in the area of video classification and detection, i.e., problems involving the prediction of a single class label or a handful of output variables per video. Furthermore, while deep networks are commonly recognized as the best models to use in these domains, there is a widespread perception that in order to yield successful results they often require time-consuming architecture search, manual tweaking of parameters and computationally intensive preprocessing or post-processing methods. In this paper we challenge these views by presenting a deep 3D convolutional architecture trained end to end to perform voxel-level prediction, i.e., to output a variable at every voxel of the video. Most importantly, we show that the same exact architecture can be used to achieve competitive results on three widely different voxel-prediction tasks: video semantic segmentation, optical flow estimation, and video coloring. The three networks learned on these problems are trained from raw video without any form of preprocessing and their outputs do not require post-processing to achieve outstanding performance. Thus, they offer an efficient alternative to traditional and much more computationally expensive methods in these video domains.

UR - http://www.scopus.com/inward/record.url?scp=85010192577&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85010192577&partnerID=8YFLogxK

U2 - 10.1109/CVPRW.2016.57

DO - 10.1109/CVPRW.2016.57

M3 - Conference contribution

AN - SCOPUS:85010192577

SP - 402

EP - 409

BT - Proceedings - 29th IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2016

PB - IEEE Computer Society

ER -