A complete statistical model for calibration of RNA-seq counts using external spike-ins and maximum likelihood theory

Rodoniki Athanasiadou, Benjamin Neymotin, Nathan Brandt, Wei Wang, Lionel Christiaen, David Gresham, Daniel Tranchina

Research output: Contribution to journalArticle

Abstract

A fundamental assumption, common to the vast majority of high-throughput transcriptome analyses, is that the expression of most genes is unchanged among samples and that total cellular RNA remains constant. As the number of analyzed experimental systems increases however, different independent studies demonstrate that this assumption is often violated. We present a calibration method using RNA spike-ins that allows for the measurement of absolute cellular abundance of RNA molecules. We apply the method to pooled RNA from cell populations of known sizes. For each transcript, we compute a nominal abundance that can be converted to absolute by dividing by a scale factor determined in separate experiments: the yield coefficient of the transcript relative to that of a reference spike-in measured with the same protocol. The method is derived by maximum likelihood theory in the context of a complete statistical model for sequencing counts contributed by cellular RNA and spike-ins. The counts are based on a sample from a fixed number of cells to which a fixed population of spike-in molecules has been added. We illustrate and evaluate the method with applications to two global expression data sets, one from the model eukaryote Saccharomyces cerevisiae, proliferating at different growth rates, and differentiating cardiopharyngeal cell lineages in the chordate Ciona robusta. We tested the method in a technical replicate dilution study, and in a k-fold validation study.

Original languageEnglish (US)
Pages (from-to)e1006794
JournalPLoS computational biology
Volume15
Issue number3
DOIs
StatePublished - Mar 1 2019

Fingerprint

Statistical Models
statistical models
RNA
Spike
Calibration
Statistical Model
Maximum likelihood
Maximum Likelihood
Count
calibration
Molecules
Chordata
methodology
Validation Studies
Scale factor
Cell Population
Cell
Saccharomyces Cerevisiae
Gene Expression Profiling
Cell Lineage

ASJC Scopus subject areas

  • Ecology, Evolution, Behavior and Systematics
  • Modeling and Simulation
  • Ecology
  • Molecular Biology
  • Genetics
  • Cellular and Molecular Neuroscience
  • Computational Theory and Mathematics

Cite this

A complete statistical model for calibration of RNA-seq counts using external spike-ins and maximum likelihood theory. / Athanasiadou, Rodoniki; Neymotin, Benjamin; Brandt, Nathan; Wang, Wei; Christiaen, Lionel; Gresham, David; Tranchina, Daniel.

In: PLoS computational biology, Vol. 15, No. 3, 01.03.2019, p. e1006794.

Research output: Contribution to journalArticle

@article{484d80f029dd479fb3dd2cd40757a46b,
title = "A complete statistical model for calibration of RNA-seq counts using external spike-ins and maximum likelihood theory",
abstract = "A fundamental assumption, common to the vast majority of high-throughput transcriptome analyses, is that the expression of most genes is unchanged among samples and that total cellular RNA remains constant. As the number of analyzed experimental systems increases however, different independent studies demonstrate that this assumption is often violated. We present a calibration method using RNA spike-ins that allows for the measurement of absolute cellular abundance of RNA molecules. We apply the method to pooled RNA from cell populations of known sizes. For each transcript, we compute a nominal abundance that can be converted to absolute by dividing by a scale factor determined in separate experiments: the yield coefficient of the transcript relative to that of a reference spike-in measured with the same protocol. The method is derived by maximum likelihood theory in the context of a complete statistical model for sequencing counts contributed by cellular RNA and spike-ins. The counts are based on a sample from a fixed number of cells to which a fixed population of spike-in molecules has been added. We illustrate and evaluate the method with applications to two global expression data sets, one from the model eukaryote Saccharomyces cerevisiae, proliferating at different growth rates, and differentiating cardiopharyngeal cell lineages in the chordate Ciona robusta. We tested the method in a technical replicate dilution study, and in a k-fold validation study.",
author = "Rodoniki Athanasiadou and Benjamin Neymotin and Nathan Brandt and Wei Wang and Lionel Christiaen and David Gresham and Daniel Tranchina",
year = "2019",
month = "3",
day = "1",
doi = "10.1371/journal.pcbi.1006794",
language = "English (US)",
volume = "15",
pages = "e1006794",
journal = "PLoS Computational Biology",
issn = "1553-734X",
publisher = "Public Library of Science",
number = "3",

}

TY - JOUR

T1 - A complete statistical model for calibration of RNA-seq counts using external spike-ins and maximum likelihood theory

AU - Athanasiadou, Rodoniki

AU - Neymotin, Benjamin

AU - Brandt, Nathan

AU - Wang, Wei

AU - Christiaen, Lionel

AU - Gresham, David

AU - Tranchina, Daniel

PY - 2019/3/1

Y1 - 2019/3/1

N2 - A fundamental assumption, common to the vast majority of high-throughput transcriptome analyses, is that the expression of most genes is unchanged among samples and that total cellular RNA remains constant. As the number of analyzed experimental systems increases however, different independent studies demonstrate that this assumption is often violated. We present a calibration method using RNA spike-ins that allows for the measurement of absolute cellular abundance of RNA molecules. We apply the method to pooled RNA from cell populations of known sizes. For each transcript, we compute a nominal abundance that can be converted to absolute by dividing by a scale factor determined in separate experiments: the yield coefficient of the transcript relative to that of a reference spike-in measured with the same protocol. The method is derived by maximum likelihood theory in the context of a complete statistical model for sequencing counts contributed by cellular RNA and spike-ins. The counts are based on a sample from a fixed number of cells to which a fixed population of spike-in molecules has been added. We illustrate and evaluate the method with applications to two global expression data sets, one from the model eukaryote Saccharomyces cerevisiae, proliferating at different growth rates, and differentiating cardiopharyngeal cell lineages in the chordate Ciona robusta. We tested the method in a technical replicate dilution study, and in a k-fold validation study.

AB - A fundamental assumption, common to the vast majority of high-throughput transcriptome analyses, is that the expression of most genes is unchanged among samples and that total cellular RNA remains constant. As the number of analyzed experimental systems increases however, different independent studies demonstrate that this assumption is often violated. We present a calibration method using RNA spike-ins that allows for the measurement of absolute cellular abundance of RNA molecules. We apply the method to pooled RNA from cell populations of known sizes. For each transcript, we compute a nominal abundance that can be converted to absolute by dividing by a scale factor determined in separate experiments: the yield coefficient of the transcript relative to that of a reference spike-in measured with the same protocol. The method is derived by maximum likelihood theory in the context of a complete statistical model for sequencing counts contributed by cellular RNA and spike-ins. The counts are based on a sample from a fixed number of cells to which a fixed population of spike-in molecules has been added. We illustrate and evaluate the method with applications to two global expression data sets, one from the model eukaryote Saccharomyces cerevisiae, proliferating at different growth rates, and differentiating cardiopharyngeal cell lineages in the chordate Ciona robusta. We tested the method in a technical replicate dilution study, and in a k-fold validation study.

UR - http://www.scopus.com/inward/record.url?scp=85063623530&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85063623530&partnerID=8YFLogxK

U2 - 10.1371/journal.pcbi.1006794

DO - 10.1371/journal.pcbi.1006794

M3 - Article

C2 - 30856174

AN - SCOPUS:85063623530

VL - 15

SP - e1006794

JO - PLoS Computational Biology

JF - PLoS Computational Biology

SN - 1553-734X

IS - 3

ER -