DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly

Ilari Scheinin, Daoud Sie, Henrik Bengtsson, Mark A. Van De Wiel, Adam B. Olshen, Hinke F. Van Thuijl, Hendrik F. Van Essen, Paul P. Eijk, Franc¸ois Rustenburg, Gerrit A. Meijer, Jaap C. Reijneveld, Pieter Wesseling, Daniel Pinkel, Donna Albertson, Bauke Ylstra

Research output: Contribution to journalArticle

Abstract

Detection of DNA copy number aberrations by shallow whole-genome sequencing (WGS) faces many challenges, including lack of completion and errors in the human reference genome, repetitive sequences, polymorphisms, variable sample quality, and biases in the sequencing procedures. Formalin-fixed paraffin-embedded (FFPE) archival material, the analysis of which is important for studies of cancer, presents particular analytical difficulties due to degradation of the DNA and frequent lack of matched reference samples. We present a robust, cost-effective WGS method for DNA copy number analysis that addresses these challenges more successfully than currently available procedures. In practice, very useful profiles can be obtained with ∼0.1x genome coverage. We improve on previous methods by first implementing a combined correction for sequence mappability and GC content, and second, by applying this procedure to sequence data from the 1000 Genomes Project in order to develop a blacklist of problematic genome regions. A small subset of these blacklisted regions was previously identified by ENCODE, but the vast majority are novel unappreciated problematic regions. Our procedures are implemented in a pipeline called QDNAseq. We have analyzed over 1000 samples, most of which were obtained from the fixed tissue archives of more than 25 institutions. We demonstrate that for most samples our sequencing and analysis procedures yield genome profiles with noise levels near the statistical limit imposed by read counting. The described procedures also provide better correction of artifacts introduced by low DNA quality than prior approaches and better copy number data than high-resolution microarrays at a substantially lower cost.

Original languageEnglish (US)
Pages (from-to)2022-2032
Number of pages11
JournalGenome Research
Volume24
Issue number12
DOIs
StatePublished - Dec 1 2014

Fingerprint

Formaldehyde
Genome
DNA
Costs and Cost Analysis
Nucleic Acid Repetitive Sequences
Base Composition
Human Genome
DNA Sequence Analysis
Paraffin
Artifacts
Noise
Neoplasms

ASJC Scopus subject areas

  • Genetics
  • Genetics(clinical)

Cite this

DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. / Scheinin, Ilari; Sie, Daoud; Bengtsson, Henrik; Van De Wiel, Mark A.; Olshen, Adam B.; Van Thuijl, Hinke F.; Van Essen, Hendrik F.; Eijk, Paul P.; Rustenburg, Franc¸ois; Meijer, Gerrit A.; Reijneveld, Jaap C.; Wesseling, Pieter; Pinkel, Daniel; Albertson, Donna; Ylstra, Bauke.

In: Genome Research, Vol. 24, No. 12, 01.12.2014, p. 2022-2032.

Research output: Contribution to journalArticle

Scheinin, I, Sie, D, Bengtsson, H, Van De Wiel, MA, Olshen, AB, Van Thuijl, HF, Van Essen, HF, Eijk, PP, Rustenburg, F, Meijer, GA, Reijneveld, JC, Wesseling, P, Pinkel, D, Albertson, D & Ylstra, B 2014, 'DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly', Genome Research, vol. 24, no. 12, pp. 2022-2032. https://doi.org/10.1101/gr.175141.114
Scheinin, Ilari ; Sie, Daoud ; Bengtsson, Henrik ; Van De Wiel, Mark A. ; Olshen, Adam B. ; Van Thuijl, Hinke F. ; Van Essen, Hendrik F. ; Eijk, Paul P. ; Rustenburg, Franc¸ois ; Meijer, Gerrit A. ; Reijneveld, Jaap C. ; Wesseling, Pieter ; Pinkel, Daniel ; Albertson, Donna ; Ylstra, Bauke. / DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly. In: Genome Research. 2014 ; Vol. 24, No. 12. pp. 2022-2032.
@article{8aa4bf8608c040edbe45322021145716,
title = "DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly",
abstract = "Detection of DNA copy number aberrations by shallow whole-genome sequencing (WGS) faces many challenges, including lack of completion and errors in the human reference genome, repetitive sequences, polymorphisms, variable sample quality, and biases in the sequencing procedures. Formalin-fixed paraffin-embedded (FFPE) archival material, the analysis of which is important for studies of cancer, presents particular analytical difficulties due to degradation of the DNA and frequent lack of matched reference samples. We present a robust, cost-effective WGS method for DNA copy number analysis that addresses these challenges more successfully than currently available procedures. In practice, very useful profiles can be obtained with ∼0.1x genome coverage. We improve on previous methods by first implementing a combined correction for sequence mappability and GC content, and second, by applying this procedure to sequence data from the 1000 Genomes Project in order to develop a blacklist of problematic genome regions. A small subset of these blacklisted regions was previously identified by ENCODE, but the vast majority are novel unappreciated problematic regions. Our procedures are implemented in a pipeline called QDNAseq. We have analyzed over 1000 samples, most of which were obtained from the fixed tissue archives of more than 25 institutions. We demonstrate that for most samples our sequencing and analysis procedures yield genome profiles with noise levels near the statistical limit imposed by read counting. The described procedures also provide better correction of artifacts introduced by low DNA quality than prior approaches and better copy number data than high-resolution microarrays at a substantially lower cost.",
author = "Ilari Scheinin and Daoud Sie and Henrik Bengtsson and {Van De Wiel}, {Mark A.} and Olshen, {Adam B.} and {Van Thuijl}, {Hinke F.} and {Van Essen}, {Hendrik F.} and Eijk, {Paul P.} and Franc¸ois Rustenburg and Meijer, {Gerrit A.} and Reijneveld, {Jaap C.} and Pieter Wesseling and Daniel Pinkel and Donna Albertson and Bauke Ylstra",
year = "2014",
month = "12",
day = "1",
doi = "10.1101/gr.175141.114",
language = "English (US)",
volume = "24",
pages = "2022--2032",
journal = "Genome Research",
issn = "1088-9051",
publisher = "Cold Spring Harbor Laboratory Press",
number = "12",

}

TY - JOUR

T1 - DNA copy number analysis of fresh and formalin-fixed specimens by shallow whole-genome sequencing with identification and exclusion of problematic regions in the genome assembly

AU - Scheinin, Ilari

AU - Sie, Daoud

AU - Bengtsson, Henrik

AU - Van De Wiel, Mark A.

AU - Olshen, Adam B.

AU - Van Thuijl, Hinke F.

AU - Van Essen, Hendrik F.

AU - Eijk, Paul P.

AU - Rustenburg, Franc¸ois

AU - Meijer, Gerrit A.

AU - Reijneveld, Jaap C.

AU - Wesseling, Pieter

AU - Pinkel, Daniel

AU - Albertson, Donna

AU - Ylstra, Bauke

PY - 2014/12/1

Y1 - 2014/12/1

N2 - Detection of DNA copy number aberrations by shallow whole-genome sequencing (WGS) faces many challenges, including lack of completion and errors in the human reference genome, repetitive sequences, polymorphisms, variable sample quality, and biases in the sequencing procedures. Formalin-fixed paraffin-embedded (FFPE) archival material, the analysis of which is important for studies of cancer, presents particular analytical difficulties due to degradation of the DNA and frequent lack of matched reference samples. We present a robust, cost-effective WGS method for DNA copy number analysis that addresses these challenges more successfully than currently available procedures. In practice, very useful profiles can be obtained with ∼0.1x genome coverage. We improve on previous methods by first implementing a combined correction for sequence mappability and GC content, and second, by applying this procedure to sequence data from the 1000 Genomes Project in order to develop a blacklist of problematic genome regions. A small subset of these blacklisted regions was previously identified by ENCODE, but the vast majority are novel unappreciated problematic regions. Our procedures are implemented in a pipeline called QDNAseq. We have analyzed over 1000 samples, most of which were obtained from the fixed tissue archives of more than 25 institutions. We demonstrate that for most samples our sequencing and analysis procedures yield genome profiles with noise levels near the statistical limit imposed by read counting. The described procedures also provide better correction of artifacts introduced by low DNA quality than prior approaches and better copy number data than high-resolution microarrays at a substantially lower cost.

AB - Detection of DNA copy number aberrations by shallow whole-genome sequencing (WGS) faces many challenges, including lack of completion and errors in the human reference genome, repetitive sequences, polymorphisms, variable sample quality, and biases in the sequencing procedures. Formalin-fixed paraffin-embedded (FFPE) archival material, the analysis of which is important for studies of cancer, presents particular analytical difficulties due to degradation of the DNA and frequent lack of matched reference samples. We present a robust, cost-effective WGS method for DNA copy number analysis that addresses these challenges more successfully than currently available procedures. In practice, very useful profiles can be obtained with ∼0.1x genome coverage. We improve on previous methods by first implementing a combined correction for sequence mappability and GC content, and second, by applying this procedure to sequence data from the 1000 Genomes Project in order to develop a blacklist of problematic genome regions. A small subset of these blacklisted regions was previously identified by ENCODE, but the vast majority are novel unappreciated problematic regions. Our procedures are implemented in a pipeline called QDNAseq. We have analyzed over 1000 samples, most of which were obtained from the fixed tissue archives of more than 25 institutions. We demonstrate that for most samples our sequencing and analysis procedures yield genome profiles with noise levels near the statistical limit imposed by read counting. The described procedures also provide better correction of artifacts introduced by low DNA quality than prior approaches and better copy number data than high-resolution microarrays at a substantially lower cost.

UR - http://www.scopus.com/inward/record.url?scp=84913590291&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84913590291&partnerID=8YFLogxK

U2 - 10.1101/gr.175141.114

DO - 10.1101/gr.175141.114

M3 - Article

VL - 24

SP - 2022

EP - 2032

JO - Genome Research

JF - Genome Research

SN - 1088-9051

IS - 12

ER -