A nearly linear-time general algorithm for Genome-Wide Bi-allele haplotype phasing

Will Casey, Bhubaneswar Mishra

Research output: Contribution to journalArticle

Abstract

The determination of feature maps, such as STSs (sequence tag sites), SNPs (single nucleotide polymorphisms) or RFLP (restriction fragment length polymorphisms) maps, for each chromosome copy or haplotype in an individual has important potential applications to genetics, clinical biology and association studies. We consider the problem of reconstructing two haplotypes of a diploid individual from genotype data generated by mapping experiments, and present an algorithm to recover haplotypes. The problem of optimizing existing methods of SNP phasing with a population of diploid genotypes has been investigated in [7] and found to be NP-hard. In contrast, using single molecule methods, we show that although haplotypes are not known and data are further confounded by the mapping error model, reasonable assumptions on the mapping process allow us to recover the co-associations of allele types across consecutive loci and estimate the haplotypes with an efficient algorithm. The haplotype reconstruction algorithm requires two stages: Stage I is the detection of polymorphic marker types, this is done by modifying an EM-algorithm for Gaussian mixture models and an example is given for RFLP sizing. Stage II focuses on the problem of phasing and presents a method of local maximum likelihood for the inference of haplotypes in an individual. The algorithm presented is nearly linear in the number of polymorphic loci. The algorithm results, run on simulated RFLP sizing data, are encouraging, and suggest that the method will prove practical for haplotype phasing.

Original languageEnglish (US)
Pages (from-to)204-215
Number of pages12
JournalLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2913
StatePublished - 2003

Fingerprint

Haplotype
Haplotypes
Linear Time
Polymorphism
Genome
Genes
Alleles
Restriction Fragment Length Polymorphisms
Nucleotides
Fragment
Single nucleotide Polymorphism
Restriction
Genotype
Diploidy
Single Nucleotide Polymorphism
Locus
Local Likelihood
Chromosomes
Maximum likelihood
Error Model

ASJC Scopus subject areas

  • Computer Science(all)
  • Biochemistry, Genetics and Molecular Biology(all)
  • Theoretical Computer Science

Cite this

@article{f3b816c571784c3a8e155206cff3d454,
title = "A nearly linear-time general algorithm for Genome-Wide Bi-allele haplotype phasing",
abstract = "The determination of feature maps, such as STSs (sequence tag sites), SNPs (single nucleotide polymorphisms) or RFLP (restriction fragment length polymorphisms) maps, for each chromosome copy or haplotype in an individual has important potential applications to genetics, clinical biology and association studies. We consider the problem of reconstructing two haplotypes of a diploid individual from genotype data generated by mapping experiments, and present an algorithm to recover haplotypes. The problem of optimizing existing methods of SNP phasing with a population of diploid genotypes has been investigated in [7] and found to be NP-hard. In contrast, using single molecule methods, we show that although haplotypes are not known and data are further confounded by the mapping error model, reasonable assumptions on the mapping process allow us to recover the co-associations of allele types across consecutive loci and estimate the haplotypes with an efficient algorithm. The haplotype reconstruction algorithm requires two stages: Stage I is the detection of polymorphic marker types, this is done by modifying an EM-algorithm for Gaussian mixture models and an example is given for RFLP sizing. Stage II focuses on the problem of phasing and presents a method of local maximum likelihood for the inference of haplotypes in an individual. The algorithm presented is nearly linear in the number of polymorphic loci. The algorithm results, run on simulated RFLP sizing data, are encouraging, and suggest that the method will prove practical for haplotype phasing.",
author = "Will Casey and Bhubaneswar Mishra",
year = "2003",
language = "English (US)",
volume = "2913",
pages = "204--215",
journal = "Lecture Notes in Computer Science",
issn = "0302-9743",
publisher = "Springer Verlag",

}

TY - JOUR

T1 - A nearly linear-time general algorithm for Genome-Wide Bi-allele haplotype phasing

AU - Casey, Will

AU - Mishra, Bhubaneswar

PY - 2003

Y1 - 2003

N2 - The determination of feature maps, such as STSs (sequence tag sites), SNPs (single nucleotide polymorphisms) or RFLP (restriction fragment length polymorphisms) maps, for each chromosome copy or haplotype in an individual has important potential applications to genetics, clinical biology and association studies. We consider the problem of reconstructing two haplotypes of a diploid individual from genotype data generated by mapping experiments, and present an algorithm to recover haplotypes. The problem of optimizing existing methods of SNP phasing with a population of diploid genotypes has been investigated in [7] and found to be NP-hard. In contrast, using single molecule methods, we show that although haplotypes are not known and data are further confounded by the mapping error model, reasonable assumptions on the mapping process allow us to recover the co-associations of allele types across consecutive loci and estimate the haplotypes with an efficient algorithm. The haplotype reconstruction algorithm requires two stages: Stage I is the detection of polymorphic marker types, this is done by modifying an EM-algorithm for Gaussian mixture models and an example is given for RFLP sizing. Stage II focuses on the problem of phasing and presents a method of local maximum likelihood for the inference of haplotypes in an individual. The algorithm presented is nearly linear in the number of polymorphic loci. The algorithm results, run on simulated RFLP sizing data, are encouraging, and suggest that the method will prove practical for haplotype phasing.

AB - The determination of feature maps, such as STSs (sequence tag sites), SNPs (single nucleotide polymorphisms) or RFLP (restriction fragment length polymorphisms) maps, for each chromosome copy or haplotype in an individual has important potential applications to genetics, clinical biology and association studies. We consider the problem of reconstructing two haplotypes of a diploid individual from genotype data generated by mapping experiments, and present an algorithm to recover haplotypes. The problem of optimizing existing methods of SNP phasing with a population of diploid genotypes has been investigated in [7] and found to be NP-hard. In contrast, using single molecule methods, we show that although haplotypes are not known and data are further confounded by the mapping error model, reasonable assumptions on the mapping process allow us to recover the co-associations of allele types across consecutive loci and estimate the haplotypes with an efficient algorithm. The haplotype reconstruction algorithm requires two stages: Stage I is the detection of polymorphic marker types, this is done by modifying an EM-algorithm for Gaussian mixture models and an example is given for RFLP sizing. Stage II focuses on the problem of phasing and presents a method of local maximum likelihood for the inference of haplotypes in an individual. The algorithm presented is nearly linear in the number of polymorphic loci. The algorithm results, run on simulated RFLP sizing data, are encouraging, and suggest that the method will prove practical for haplotype phasing.

UR - http://www.scopus.com/inward/record.url?scp=33846788496&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33846788496&partnerID=8YFLogxK

M3 - Article

VL - 2913

SP - 204

EP - 215

JO - Lecture Notes in Computer Science

JF - Lecture Notes in Computer Science

SN - 0302-9743

ER -