Look, Listen, and Learn More

Design Choices for Deep Audio Embeddings

Jason Cramer, Ho Hsiang Wu, Justin Salamon, Juan Bello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A considerable challenge in applying deep learning to audio classification is the scarcity of labeled data. An increasingly popular solution is to learn deep audio embeddings from large audio collections and use them to train shallow classifiers using small labeled datasets. Look, Listen, and Learn (L3-Net) is an embedding trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data. This framework has the potential to produce powerful out-of-the-box embeddings for downstream audio classification tasks, but has a number of unexplained design choices that may impact the embeddings' behavior. In this paper we investigate how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings. We show that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key. Surprisingly, we find that matching the content for training the embedding to the downstream task is not beneficial. Finally, we show that our best variant of the L3-Net embedding outperforms both the VGGish and SoundNet embeddings, while having fewer parameters and being trained on less data. Our implementation of the L3-Net embedding model as well as pre-trained models are made freely available online.

Original languageEnglish (US)
Title of host publication2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3852-3856
Number of pages5
ISBN (Electronic)9781479981311
DOIs
StatePublished - May 1 2019
Event44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Brighton, United Kingdom
Duration: May 12 2019May 17 2019

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume2019-May
ISSN (Print)1520-6149

Conference

Conference44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019
CountryUnited Kingdom
CityBrighton
Period5/12/195/17/19

Fingerprint

Classifiers
Supervised learning
Deep learning

Keywords

  • Audio classification
  • deep audio embeddings
  • deep learning
  • machine listening
  • transfer learning

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Cramer, J., Wu, H. H., Salamon, J., & Bello, J. (2019). Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings (pp. 3852-3856). [8682475] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2019.8682475

Look, Listen, and Learn More : Design Choices for Deep Audio Embeddings. / Cramer, Jason; Wu, Ho Hsiang; Salamon, Justin; Bello, Juan.

2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. p. 3852-3856 8682475 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 2019-May).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cramer, J, Wu, HH, Salamon, J & Bello, J 2019, Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings. in 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings., 8682475, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 2019-May, Institute of Electrical and Electronics Engineers Inc., pp. 3852-3856, 44th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019, Brighton, United Kingdom, 5/12/19. https://doi.org/10.1109/ICASSP.2019.8682475
Cramer J, Wu HH, Salamon J, Bello J. Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings. In 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2019. p. 3852-3856. 8682475. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2019.8682475
Cramer, Jason ; Wu, Ho Hsiang ; Salamon, Justin ; Bello, Juan. / Look, Listen, and Learn More : Design Choices for Deep Audio Embeddings. 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 3852-3856 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
@inproceedings{15efe79d019840ddb70998d6d8c1e437,
title = "Look, Listen, and Learn More: Design Choices for Deep Audio Embeddings",
abstract = "A considerable challenge in applying deep learning to audio classification is the scarcity of labeled data. An increasingly popular solution is to learn deep audio embeddings from large audio collections and use them to train shallow classifiers using small labeled datasets. Look, Listen, and Learn (L3-Net) is an embedding trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data. This framework has the potential to produce powerful out-of-the-box embeddings for downstream audio classification tasks, but has a number of unexplained design choices that may impact the embeddings' behavior. In this paper we investigate how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings. We show that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key. Surprisingly, we find that matching the content for training the embedding to the downstream task is not beneficial. Finally, we show that our best variant of the L3-Net embedding outperforms both the VGGish and SoundNet embeddings, while having fewer parameters and being trained on less data. Our implementation of the L3-Net embedding model as well as pre-trained models are made freely available online.",
keywords = "Audio classification, deep audio embeddings, deep learning, machine listening, transfer learning",
author = "Jason Cramer and Wu, {Ho Hsiang} and Justin Salamon and Juan Bello",
year = "2019",
month = "5",
day = "1",
doi = "10.1109/ICASSP.2019.8682475",
language = "English (US)",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "3852--3856",
booktitle = "2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings",

}

TY - GEN

T1 - Look, Listen, and Learn More

T2 - Design Choices for Deep Audio Embeddings

AU - Cramer, Jason

AU - Wu, Ho Hsiang

AU - Salamon, Justin

AU - Bello, Juan

PY - 2019/5/1

Y1 - 2019/5/1

N2 - A considerable challenge in applying deep learning to audio classification is the scarcity of labeled data. An increasingly popular solution is to learn deep audio embeddings from large audio collections and use them to train shallow classifiers using small labeled datasets. Look, Listen, and Learn (L3-Net) is an embedding trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data. This framework has the potential to produce powerful out-of-the-box embeddings for downstream audio classification tasks, but has a number of unexplained design choices that may impact the embeddings' behavior. In this paper we investigate how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings. We show that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key. Surprisingly, we find that matching the content for training the embedding to the downstream task is not beneficial. Finally, we show that our best variant of the L3-Net embedding outperforms both the VGGish and SoundNet embeddings, while having fewer parameters and being trained on less data. Our implementation of the L3-Net embedding model as well as pre-trained models are made freely available online.

AB - A considerable challenge in applying deep learning to audio classification is the scarcity of labeled data. An increasingly popular solution is to learn deep audio embeddings from large audio collections and use them to train shallow classifiers using small labeled datasets. Look, Listen, and Learn (L3-Net) is an embedding trained through self-supervised learning of audio-visual correspondence in videos as opposed to other embeddings requiring labeled data. This framework has the potential to produce powerful out-of-the-box embeddings for downstream audio classification tasks, but has a number of unexplained design choices that may impact the embeddings' behavior. In this paper we investigate how L3-Net design choices impact the performance of downstream audio classifiers trained with these embeddings. We show that audio-informed choices of input representation are important, and that using sufficient data for training the embedding is key. Surprisingly, we find that matching the content for training the embedding to the downstream task is not beneficial. Finally, we show that our best variant of the L3-Net embedding outperforms both the VGGish and SoundNet embeddings, while having fewer parameters and being trained on less data. Our implementation of the L3-Net embedding model as well as pre-trained models are made freely available online.

KW - Audio classification

KW - deep audio embeddings

KW - deep learning

KW - machine listening

KW - transfer learning

UR - http://www.scopus.com/inward/record.url?scp=85068992001&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85068992001&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2019.8682475

DO - 10.1109/ICASSP.2019.8682475

M3 - Conference contribution

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 3852

EP - 3856

BT - 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2019 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -