Learning to synthesize 3D indoor scenes from monocular images

Fan Zhu, Fumin Shen, Li Liu, Ling Shao, Jin Xie, Yi Fang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Depth images have always been playing critical roles for indoor scene understanding problems, and are particularly important for tasks in which 3D inferences are involved. However, since depth images are not universally available, abandoning them from the testing stage can significantly improve the generality of a method. In this work, we consider the scenarios where depth images are not available in the testing data, and propose to learn a convolutional long short-term memory (Conv LSTM) network and a regression convolutional neural network (regression ConvNet) using only monocular RGB images. The proposed networks benefit from 2D segmentations, object-level spatial context, object-scene dependencies and objects' geometric information, where optimization is governed by the semantic label loss, which measures the label consistencies of both objects and scenes, and the 3D geometrical loss, which measures the correctness of objects' 6Dof estimation. Conv LSTM and regression ConvNet are applied to scene/object classification, object detection and 6Dof estimation tasks respectively, where we utilize the joint inference from both networks and further provide the perspective of synthesizing fully rigged 3D scenes according to objects' arrangements in monocular images. Both quantitative and qualitative experimental results are provided on the NYU-v2 dataset, and we demonstrate that the proposed Conv LSTM can achieve state-of-the-art performance without requiring the depth information.

Original languageEnglish (US)
Title of host publicationMM 2018 - Proceedings of the 2018 ACM Multimedia Conference
PublisherAssociation for Computing Machinery, Inc
Pages501-509
Number of pages9
ISBN (Electronic)9781450356657
DOIs
StatePublished - Oct 15 2018
Event26th ACM Multimedia conference, MM 2018 - Seoul, Korea, Republic of
Duration: Oct 22 2018Oct 26 2018

Other

Other26th ACM Multimedia conference, MM 2018
CountryKorea, Republic of
CitySeoul
Period10/22/1810/26/18

Fingerprint

Labels
Neural networks
Testing
Semantics
Long short-term memory
Object detection

Keywords

  • CNN
  • Indoor scene understanding
  • LSTM
  • Object detection
  • Scene classification

ASJC Scopus subject areas

  • Computer Graphics and Computer-Aided Design
  • Human-Computer Interaction

Cite this

Zhu, F., Shen, F., Liu, L., Shao, L., Xie, J., & Fang, Y. (2018). Learning to synthesize 3D indoor scenes from monocular images. In MM 2018 - Proceedings of the 2018 ACM Multimedia Conference (pp. 501-509). Association for Computing Machinery, Inc. https://doi.org/10.1145/3240508.3240700

Learning to synthesize 3D indoor scenes from monocular images. / Zhu, Fan; Shen, Fumin; Liu, Li; Shao, Ling; Xie, Jin; Fang, Yi.

MM 2018 - Proceedings of the 2018 ACM Multimedia Conference. Association for Computing Machinery, Inc, 2018. p. 501-509.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhu, F, Shen, F, Liu, L, Shao, L, Xie, J & Fang, Y 2018, Learning to synthesize 3D indoor scenes from monocular images. in MM 2018 - Proceedings of the 2018 ACM Multimedia Conference. Association for Computing Machinery, Inc, pp. 501-509, 26th ACM Multimedia conference, MM 2018, Seoul, Korea, Republic of, 10/22/18. https://doi.org/10.1145/3240508.3240700
Zhu F, Shen F, Liu L, Shao L, Xie J, Fang Y. Learning to synthesize 3D indoor scenes from monocular images. In MM 2018 - Proceedings of the 2018 ACM Multimedia Conference. Association for Computing Machinery, Inc. 2018. p. 501-509 https://doi.org/10.1145/3240508.3240700
Zhu, Fan ; Shen, Fumin ; Liu, Li ; Shao, Ling ; Xie, Jin ; Fang, Yi. / Learning to synthesize 3D indoor scenes from monocular images. MM 2018 - Proceedings of the 2018 ACM Multimedia Conference. Association for Computing Machinery, Inc, 2018. pp. 501-509
@inproceedings{78a30d842dee4f74bc551b3220487c61,
title = "Learning to synthesize 3D indoor scenes from monocular images",
abstract = "Depth images have always been playing critical roles for indoor scene understanding problems, and are particularly important for tasks in which 3D inferences are involved. However, since depth images are not universally available, abandoning them from the testing stage can significantly improve the generality of a method. In this work, we consider the scenarios where depth images are not available in the testing data, and propose to learn a convolutional long short-term memory (Conv LSTM) network and a regression convolutional neural network (regression ConvNet) using only monocular RGB images. The proposed networks benefit from 2D segmentations, object-level spatial context, object-scene dependencies and objects' geometric information, where optimization is governed by the semantic label loss, which measures the label consistencies of both objects and scenes, and the 3D geometrical loss, which measures the correctness of objects' 6Dof estimation. Conv LSTM and regression ConvNet are applied to scene/object classification, object detection and 6Dof estimation tasks respectively, where we utilize the joint inference from both networks and further provide the perspective of synthesizing fully rigged 3D scenes according to objects' arrangements in monocular images. Both quantitative and qualitative experimental results are provided on the NYU-v2 dataset, and we demonstrate that the proposed Conv LSTM can achieve state-of-the-art performance without requiring the depth information.",
keywords = "CNN, Indoor scene understanding, LSTM, Object detection, Scene classification",
author = "Fan Zhu and Fumin Shen and Li Liu and Ling Shao and Jin Xie and Yi Fang",
year = "2018",
month = "10",
day = "15",
doi = "10.1145/3240508.3240700",
language = "English (US)",
pages = "501--509",
booktitle = "MM 2018 - Proceedings of the 2018 ACM Multimedia Conference",
publisher = "Association for Computing Machinery, Inc",

}

TY - GEN

T1 - Learning to synthesize 3D indoor scenes from monocular images

AU - Zhu, Fan

AU - Shen, Fumin

AU - Liu, Li

AU - Shao, Ling

AU - Xie, Jin

AU - Fang, Yi

PY - 2018/10/15

Y1 - 2018/10/15

N2 - Depth images have always been playing critical roles for indoor scene understanding problems, and are particularly important for tasks in which 3D inferences are involved. However, since depth images are not universally available, abandoning them from the testing stage can significantly improve the generality of a method. In this work, we consider the scenarios where depth images are not available in the testing data, and propose to learn a convolutional long short-term memory (Conv LSTM) network and a regression convolutional neural network (regression ConvNet) using only monocular RGB images. The proposed networks benefit from 2D segmentations, object-level spatial context, object-scene dependencies and objects' geometric information, where optimization is governed by the semantic label loss, which measures the label consistencies of both objects and scenes, and the 3D geometrical loss, which measures the correctness of objects' 6Dof estimation. Conv LSTM and regression ConvNet are applied to scene/object classification, object detection and 6Dof estimation tasks respectively, where we utilize the joint inference from both networks and further provide the perspective of synthesizing fully rigged 3D scenes according to objects' arrangements in monocular images. Both quantitative and qualitative experimental results are provided on the NYU-v2 dataset, and we demonstrate that the proposed Conv LSTM can achieve state-of-the-art performance without requiring the depth information.

AB - Depth images have always been playing critical roles for indoor scene understanding problems, and are particularly important for tasks in which 3D inferences are involved. However, since depth images are not universally available, abandoning them from the testing stage can significantly improve the generality of a method. In this work, we consider the scenarios where depth images are not available in the testing data, and propose to learn a convolutional long short-term memory (Conv LSTM) network and a regression convolutional neural network (regression ConvNet) using only monocular RGB images. The proposed networks benefit from 2D segmentations, object-level spatial context, object-scene dependencies and objects' geometric information, where optimization is governed by the semantic label loss, which measures the label consistencies of both objects and scenes, and the 3D geometrical loss, which measures the correctness of objects' 6Dof estimation. Conv LSTM and regression ConvNet are applied to scene/object classification, object detection and 6Dof estimation tasks respectively, where we utilize the joint inference from both networks and further provide the perspective of synthesizing fully rigged 3D scenes according to objects' arrangements in monocular images. Both quantitative and qualitative experimental results are provided on the NYU-v2 dataset, and we demonstrate that the proposed Conv LSTM can achieve state-of-the-art performance without requiring the depth information.

KW - CNN

KW - Indoor scene understanding

KW - LSTM

KW - Object detection

KW - Scene classification

UR - http://www.scopus.com/inward/record.url?scp=85058226066&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85058226066&partnerID=8YFLogxK

U2 - 10.1145/3240508.3240700

DO - 10.1145/3240508.3240700

M3 - Conference contribution

AN - SCOPUS:85058226066

SP - 501

EP - 509

BT - MM 2018 - Proceedings of the 2018 ACM Multimedia Conference

PB - Association for Computing Machinery, Inc

ER -