Variational convolutional networks for human-centric annotations

Tsung Wei Ke, Che Wei Lin, Tyng Luh Liu, Davi Geiger

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

To model how a human would annotate an image is an important and interesting task relevant to image captioning. Its main challenge is that a same visual concept may be important in some images but becomes less salient in other situations. Further, the subjective viewpoints of a human annotator also play a crucial role in finalizing the annotations. To deal with such high variability, we introduce a new deep net model that integrates a CNN with a variational auto-encoder (VAE). With the latent features embedded in a VAE, it becomes more flexible to tackle the uncertainly of human-centric annotations. On the other hand, the supervised generalization further enables the discriminative power of the generative VAE model. The resulting model can be end-to-end fine-tuned to further improve the performance on predicting visual concepts. The provided experimental results show that our method is state-of-the-art over two benchmark datasets: MS COCO and Flickr30K, producing mAP of 36.6 and 23.49, and PHR (Precision at Human Recall) of 49.9 and 32.04, respectively.

Original languageEnglish (US)
Title of host publicationComputer Vision - 13th Asian Conference on Computer Vision, ACCV 2016, Revised Selected Papers
PublisherSpringer Verlag
Pages120-135
Number of pages16
Volume10114 LNCS
ISBN (Print)9783319541891
DOIs
StatePublished - 2017

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10114 LNCS
ISSN (Print)03029743
ISSN (Electronic)16113349

Fingerprint

Annotation
Encoder
Model
Integrate
Benchmark
Human
Experimental Results
Concepts
Vision

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Ke, T. W., Lin, C. W., Liu, T. L., & Geiger, D. (2017). Variational convolutional networks for human-centric annotations. In Computer Vision - 13th Asian Conference on Computer Vision, ACCV 2016, Revised Selected Papers (Vol. 10114 LNCS, pp. 120-135). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10114 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-54190-7_8

Variational convolutional networks for human-centric annotations. / Ke, Tsung Wei; Lin, Che Wei; Liu, Tyng Luh; Geiger, Davi.

Computer Vision - 13th Asian Conference on Computer Vision, ACCV 2016, Revised Selected Papers. Vol. 10114 LNCS Springer Verlag, 2017. p. 120-135 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10114 LNCS).

Research output: Chapter in Book/Report/Conference proceedingChapter

Ke, TW, Lin, CW, Liu, TL & Geiger, D 2017, Variational convolutional networks for human-centric annotations. in Computer Vision - 13th Asian Conference on Computer Vision, ACCV 2016, Revised Selected Papers. vol. 10114 LNCS, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10114 LNCS, Springer Verlag, pp. 120-135. https://doi.org/10.1007/978-3-319-54190-7_8
Ke TW, Lin CW, Liu TL, Geiger D. Variational convolutional networks for human-centric annotations. In Computer Vision - 13th Asian Conference on Computer Vision, ACCV 2016, Revised Selected Papers. Vol. 10114 LNCS. Springer Verlag. 2017. p. 120-135. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-54190-7_8
Ke, Tsung Wei ; Lin, Che Wei ; Liu, Tyng Luh ; Geiger, Davi. / Variational convolutional networks for human-centric annotations. Computer Vision - 13th Asian Conference on Computer Vision, ACCV 2016, Revised Selected Papers. Vol. 10114 LNCS Springer Verlag, 2017. pp. 120-135 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inbook{8d034bcd7b2f470b843f1fd8ea5ad246,
title = "Variational convolutional networks for human-centric annotations",
abstract = "To model how a human would annotate an image is an important and interesting task relevant to image captioning. Its main challenge is that a same visual concept may be important in some images but becomes less salient in other situations. Further, the subjective viewpoints of a human annotator also play a crucial role in finalizing the annotations. To deal with such high variability, we introduce a new deep net model that integrates a CNN with a variational auto-encoder (VAE). With the latent features embedded in a VAE, it becomes more flexible to tackle the uncertainly of human-centric annotations. On the other hand, the supervised generalization further enables the discriminative power of the generative VAE model. The resulting model can be end-to-end fine-tuned to further improve the performance on predicting visual concepts. The provided experimental results show that our method is state-of-the-art over two benchmark datasets: MS COCO and Flickr30K, producing mAP of 36.6 and 23.49, and PHR (Precision at Human Recall) of 49.9 and 32.04, respectively.",
author = "Ke, {Tsung Wei} and Lin, {Che Wei} and Liu, {Tyng Luh} and Davi Geiger",
year = "2017",
doi = "10.1007/978-3-319-54190-7_8",
language = "English (US)",
isbn = "9783319541891",
volume = "10114 LNCS",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "120--135",
booktitle = "Computer Vision - 13th Asian Conference on Computer Vision, ACCV 2016, Revised Selected Papers",
address = "Germany",

}

TY - CHAP

T1 - Variational convolutional networks for human-centric annotations

AU - Ke, Tsung Wei

AU - Lin, Che Wei

AU - Liu, Tyng Luh

AU - Geiger, Davi

PY - 2017

Y1 - 2017

N2 - To model how a human would annotate an image is an important and interesting task relevant to image captioning. Its main challenge is that a same visual concept may be important in some images but becomes less salient in other situations. Further, the subjective viewpoints of a human annotator also play a crucial role in finalizing the annotations. To deal with such high variability, we introduce a new deep net model that integrates a CNN with a variational auto-encoder (VAE). With the latent features embedded in a VAE, it becomes more flexible to tackle the uncertainly of human-centric annotations. On the other hand, the supervised generalization further enables the discriminative power of the generative VAE model. The resulting model can be end-to-end fine-tuned to further improve the performance on predicting visual concepts. The provided experimental results show that our method is state-of-the-art over two benchmark datasets: MS COCO and Flickr30K, producing mAP of 36.6 and 23.49, and PHR (Precision at Human Recall) of 49.9 and 32.04, respectively.

AB - To model how a human would annotate an image is an important and interesting task relevant to image captioning. Its main challenge is that a same visual concept may be important in some images but becomes less salient in other situations. Further, the subjective viewpoints of a human annotator also play a crucial role in finalizing the annotations. To deal with such high variability, we introduce a new deep net model that integrates a CNN with a variational auto-encoder (VAE). With the latent features embedded in a VAE, it becomes more flexible to tackle the uncertainly of human-centric annotations. On the other hand, the supervised generalization further enables the discriminative power of the generative VAE model. The resulting model can be end-to-end fine-tuned to further improve the performance on predicting visual concepts. The provided experimental results show that our method is state-of-the-art over two benchmark datasets: MS COCO and Flickr30K, producing mAP of 36.6 and 23.49, and PHR (Precision at Human Recall) of 49.9 and 32.04, respectively.

UR - http://www.scopus.com/inward/record.url?scp=85016079087&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85016079087&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-54190-7_8

DO - 10.1007/978-3-319-54190-7_8

M3 - Chapter

SN - 9783319541891

VL - 10114 LNCS

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 120

EP - 135

BT - Computer Vision - 13th Asian Conference on Computer Vision, ACCV 2016, Revised Selected Papers

PB - Springer Verlag

ER -