Star-galaxy classification in multi-band optical imaging

Ross Fadely, David W. Hogg, Beth Willman

    Research output: Contribution to journalArticle

    Abstract

    Ground-based optical surveys such as PanSTARRS, DES, and LSST will produce large catalogs to limiting magnitudes of r ≳ 24. Star-galaxy separation poses a major challenge to such surveys because galaxies - even very compact galaxies - outnumber halo stars at these depths. We investigate photometric classification techniques on stars and galaxies with intrinsic FWHM <0.2 arcsec. We consider unsupervised spectral energy distribution template fitting and supervised, data-driven support vector machines (SVMs). For template fitting, we use a maximum likelihood (ML) method and a new hierarchical Bayesian (HB) method, which learns the prior distribution of template probabilities from the data. SVM requires training data to classify unknown sources; ML and HB do not. We consider (1) a best-case scenario (SVMbest) where the training data are (unrealistically) a random sampling of the data in both signal-to-noise and demographics and (2) a more realistic scenario where training is done on higher signal-to-noise data (SVMreal) at brighter apparent magnitudes. Testing with COSMOS ugriz data, we find that HB outperforms ML, delivering ∼80% completeness, with purity of ∼60%-90% for both stars and galaxies. We find that no algorithm delivers perfect performance and that studies of metal-poor main-sequence turnoff stars may be challenged by poor star-galaxy separation. Using the Receiver Operating Characteristic curve, we find a best-to-worst ranking of SVMbest, HB, ML, and SVMreal. We conclude, therefore, that a well-trained SVM will outperform template-fitting methods. However, a normally trained SVM performs worse. Thus, HB template fitting may prove to be the optimal classification method in future surveys.

    Original languageEnglish (US)
    Article number15
    JournalAstrophysical Journal
    Volume760
    Issue number1
    DOIs
    StatePublished - Nov 20 2012

    Fingerprint

    templates
    galaxies
    stars
    education
    random sampling
    compact galaxies
    ranking
    main sequence stars
    completeness
    spectral energy distribution
    catalogs
    halos
    purity
    receivers
    curves
    metals
    support vector machine
    method
    metal
    sampling

    Keywords

    • catalogs
    • galaxies: general
    • Galaxy: stellar content
    • Galaxy: structure
    • methods: data analysis
    • methods: statistical
    • stars: general
    • surveys

    ASJC Scopus subject areas

    • Space and Planetary Science
    • Astronomy and Astrophysics

    Cite this

    Star-galaxy classification in multi-band optical imaging. / Fadely, Ross; Hogg, David W.; Willman, Beth.

    In: Astrophysical Journal, Vol. 760, No. 1, 15, 20.11.2012.

    Research output: Contribution to journalArticle

    Fadely, Ross ; Hogg, David W. ; Willman, Beth. / Star-galaxy classification in multi-band optical imaging. In: Astrophysical Journal. 2012 ; Vol. 760, No. 1.
    @article{ad588ddead5c45ba842a87167fb2858d,
    title = "Star-galaxy classification in multi-band optical imaging",
    abstract = "Ground-based optical surveys such as PanSTARRS, DES, and LSST will produce large catalogs to limiting magnitudes of r ≳ 24. Star-galaxy separation poses a major challenge to such surveys because galaxies - even very compact galaxies - outnumber halo stars at these depths. We investigate photometric classification techniques on stars and galaxies with intrinsic FWHM <0.2 arcsec. We consider unsupervised spectral energy distribution template fitting and supervised, data-driven support vector machines (SVMs). For template fitting, we use a maximum likelihood (ML) method and a new hierarchical Bayesian (HB) method, which learns the prior distribution of template probabilities from the data. SVM requires training data to classify unknown sources; ML and HB do not. We consider (1) a best-case scenario (SVMbest) where the training data are (unrealistically) a random sampling of the data in both signal-to-noise and demographics and (2) a more realistic scenario where training is done on higher signal-to-noise data (SVMreal) at brighter apparent magnitudes. Testing with COSMOS ugriz data, we find that HB outperforms ML, delivering ∼80{\%} completeness, with purity of ∼60{\%}-90{\%} for both stars and galaxies. We find that no algorithm delivers perfect performance and that studies of metal-poor main-sequence turnoff stars may be challenged by poor star-galaxy separation. Using the Receiver Operating Characteristic curve, we find a best-to-worst ranking of SVMbest, HB, ML, and SVMreal. We conclude, therefore, that a well-trained SVM will outperform template-fitting methods. However, a normally trained SVM performs worse. Thus, HB template fitting may prove to be the optimal classification method in future surveys.",
    keywords = "catalogs, galaxies: general, Galaxy: stellar content, Galaxy: structure, methods: data analysis, methods: statistical, stars: general, surveys",
    author = "Ross Fadely and Hogg, {David W.} and Beth Willman",
    year = "2012",
    month = "11",
    day = "20",
    doi = "10.1088/0004-637X/760/1/15",
    language = "English (US)",
    volume = "760",
    journal = "Astrophysical Journal",
    issn = "0004-637X",
    publisher = "IOP Publishing Ltd.",
    number = "1",

    }

    TY - JOUR

    T1 - Star-galaxy classification in multi-band optical imaging

    AU - Fadely, Ross

    AU - Hogg, David W.

    AU - Willman, Beth

    PY - 2012/11/20

    Y1 - 2012/11/20

    N2 - Ground-based optical surveys such as PanSTARRS, DES, and LSST will produce large catalogs to limiting magnitudes of r ≳ 24. Star-galaxy separation poses a major challenge to such surveys because galaxies - even very compact galaxies - outnumber halo stars at these depths. We investigate photometric classification techniques on stars and galaxies with intrinsic FWHM <0.2 arcsec. We consider unsupervised spectral energy distribution template fitting and supervised, data-driven support vector machines (SVMs). For template fitting, we use a maximum likelihood (ML) method and a new hierarchical Bayesian (HB) method, which learns the prior distribution of template probabilities from the data. SVM requires training data to classify unknown sources; ML and HB do not. We consider (1) a best-case scenario (SVMbest) where the training data are (unrealistically) a random sampling of the data in both signal-to-noise and demographics and (2) a more realistic scenario where training is done on higher signal-to-noise data (SVMreal) at brighter apparent magnitudes. Testing with COSMOS ugriz data, we find that HB outperforms ML, delivering ∼80% completeness, with purity of ∼60%-90% for both stars and galaxies. We find that no algorithm delivers perfect performance and that studies of metal-poor main-sequence turnoff stars may be challenged by poor star-galaxy separation. Using the Receiver Operating Characteristic curve, we find a best-to-worst ranking of SVMbest, HB, ML, and SVMreal. We conclude, therefore, that a well-trained SVM will outperform template-fitting methods. However, a normally trained SVM performs worse. Thus, HB template fitting may prove to be the optimal classification method in future surveys.

    AB - Ground-based optical surveys such as PanSTARRS, DES, and LSST will produce large catalogs to limiting magnitudes of r ≳ 24. Star-galaxy separation poses a major challenge to such surveys because galaxies - even very compact galaxies - outnumber halo stars at these depths. We investigate photometric classification techniques on stars and galaxies with intrinsic FWHM <0.2 arcsec. We consider unsupervised spectral energy distribution template fitting and supervised, data-driven support vector machines (SVMs). For template fitting, we use a maximum likelihood (ML) method and a new hierarchical Bayesian (HB) method, which learns the prior distribution of template probabilities from the data. SVM requires training data to classify unknown sources; ML and HB do not. We consider (1) a best-case scenario (SVMbest) where the training data are (unrealistically) a random sampling of the data in both signal-to-noise and demographics and (2) a more realistic scenario where training is done on higher signal-to-noise data (SVMreal) at brighter apparent magnitudes. Testing with COSMOS ugriz data, we find that HB outperforms ML, delivering ∼80% completeness, with purity of ∼60%-90% for both stars and galaxies. We find that no algorithm delivers perfect performance and that studies of metal-poor main-sequence turnoff stars may be challenged by poor star-galaxy separation. Using the Receiver Operating Characteristic curve, we find a best-to-worst ranking of SVMbest, HB, ML, and SVMreal. We conclude, therefore, that a well-trained SVM will outperform template-fitting methods. However, a normally trained SVM performs worse. Thus, HB template fitting may prove to be the optimal classification method in future surveys.

    KW - catalogs

    KW - galaxies: general

    KW - Galaxy: stellar content

    KW - Galaxy: structure

    KW - methods: data analysis

    KW - methods: statistical

    KW - stars: general

    KW - surveys

    UR - http://www.scopus.com/inward/record.url?scp=84868248892&partnerID=8YFLogxK

    UR - http://www.scopus.com/inward/citedby.url?scp=84868248892&partnerID=8YFLogxK

    U2 - 10.1088/0004-637X/760/1/15

    DO - 10.1088/0004-637X/760/1/15

    M3 - Article

    VL - 760

    JO - Astrophysical Journal

    JF - Astrophysical Journal

    SN - 0004-637X

    IS - 1

    M1 - 15

    ER -