The science of guessing: Analyzing an anonymized corpus of 70 million passwords

Joseph Bonneau

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We report on the largest corpus of user-chosen passwords ever studied, consisting of anonymized password histograms representing almost 70 million Yahoo! users, mitigating privacy concerns while enabling analysis of dozens of subpopulations based on demographic factors and site usage characteristics. This large data set motivates a thorough statistical treatment of estimating guessing difficulty by sampling from a secret distribution. In place of previously used metrics such as Shannon entropy and guessing entropy, which cannot be estimated with any realistically sized sample, we develop partial guessing metrics including a new variant of guesswork parameterized by an attacker's desired success rate. Our new metric is comparatively easy to approximate and directly relevant for security engineering. By comparing password distributions with a uniform distribution which would provide equivalent security against different forms of guessing attack, we estimate that passwords provide fewer than 10 bits of security against an online, trawling attack, and only about 20 bits of security against an optimal offline dictionary attack. We find surprisingly little variation in guessing difficulty; every identifiable group of users generated a comparably weak password distribution. Security motivations such as the registration of a payment card have no greater impact than demographic factors such as age and nationality. Even proactive efforts to nudge users towards better password choices with graphical feedback make little difference. More surprisingly, even seemingly distant language communities choose the same weak passwords and an attacker never gains more than a factor of 2 efficiency gain by switching from the globally optimal dictionary to a population-specific lists.

Original languageEnglish (US)
Title of host publicationProceedings - 2012 IEEE Symposium on Security and Privacy, S and P 2012
Pages538-552
Number of pages15
DOIs
StatePublished - 2012
Event33rd IEEE Symposium on Security and Privacy, S and P 2012 - San Francisco, CA, United States
Duration: May 21 2012May 23 2012

Other

Other33rd IEEE Symposium on Security and Privacy, S and P 2012
CountryUnited States
CitySan Francisco, CA
Period5/21/125/23/12

Fingerprint

Glossaries
Entropy
Sampling
Feedback

Keywords

  • authentication
  • computer security
  • data mining
  • information theory
  • statistics

ASJC Scopus subject areas

  • Safety, Risk, Reliability and Quality
  • Software
  • Computer Networks and Communications

Cite this

Bonneau, J. (2012). The science of guessing: Analyzing an anonymized corpus of 70 million passwords. In Proceedings - 2012 IEEE Symposium on Security and Privacy, S and P 2012 (pp. 538-552). [6234435] https://doi.org/10.1109/SP.2012.49

The science of guessing : Analyzing an anonymized corpus of 70 million passwords. / Bonneau, Joseph.

Proceedings - 2012 IEEE Symposium on Security and Privacy, S and P 2012. 2012. p. 538-552 6234435.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Bonneau, J 2012, The science of guessing: Analyzing an anonymized corpus of 70 million passwords. in Proceedings - 2012 IEEE Symposium on Security and Privacy, S and P 2012., 6234435, pp. 538-552, 33rd IEEE Symposium on Security and Privacy, S and P 2012, San Francisco, CA, United States, 5/21/12. https://doi.org/10.1109/SP.2012.49
Bonneau J. The science of guessing: Analyzing an anonymized corpus of 70 million passwords. In Proceedings - 2012 IEEE Symposium on Security and Privacy, S and P 2012. 2012. p. 538-552. 6234435 https://doi.org/10.1109/SP.2012.49
Bonneau, Joseph. / The science of guessing : Analyzing an anonymized corpus of 70 million passwords. Proceedings - 2012 IEEE Symposium on Security and Privacy, S and P 2012. 2012. pp. 538-552
@inproceedings{fda09ad6f0dd46438023c2fa0da5e6a9,
title = "The science of guessing: Analyzing an anonymized corpus of 70 million passwords",
abstract = "We report on the largest corpus of user-chosen passwords ever studied, consisting of anonymized password histograms representing almost 70 million Yahoo! users, mitigating privacy concerns while enabling analysis of dozens of subpopulations based on demographic factors and site usage characteristics. This large data set motivates a thorough statistical treatment of estimating guessing difficulty by sampling from a secret distribution. In place of previously used metrics such as Shannon entropy and guessing entropy, which cannot be estimated with any realistically sized sample, we develop partial guessing metrics including a new variant of guesswork parameterized by an attacker's desired success rate. Our new metric is comparatively easy to approximate and directly relevant for security engineering. By comparing password distributions with a uniform distribution which would provide equivalent security against different forms of guessing attack, we estimate that passwords provide fewer than 10 bits of security against an online, trawling attack, and only about 20 bits of security against an optimal offline dictionary attack. We find surprisingly little variation in guessing difficulty; every identifiable group of users generated a comparably weak password distribution. Security motivations such as the registration of a payment card have no greater impact than demographic factors such as age and nationality. Even proactive efforts to nudge users towards better password choices with graphical feedback make little difference. More surprisingly, even seemingly distant language communities choose the same weak passwords and an attacker never gains more than a factor of 2 efficiency gain by switching from the globally optimal dictionary to a population-specific lists.",
keywords = "authentication, computer security, data mining, information theory, statistics",
author = "Joseph Bonneau",
year = "2012",
doi = "10.1109/SP.2012.49",
language = "English (US)",
isbn = "9780769546810",
pages = "538--552",
booktitle = "Proceedings - 2012 IEEE Symposium on Security and Privacy, S and P 2012",

}

TY - GEN

T1 - The science of guessing

T2 - Analyzing an anonymized corpus of 70 million passwords

AU - Bonneau, Joseph

PY - 2012

Y1 - 2012

N2 - We report on the largest corpus of user-chosen passwords ever studied, consisting of anonymized password histograms representing almost 70 million Yahoo! users, mitigating privacy concerns while enabling analysis of dozens of subpopulations based on demographic factors and site usage characteristics. This large data set motivates a thorough statistical treatment of estimating guessing difficulty by sampling from a secret distribution. In place of previously used metrics such as Shannon entropy and guessing entropy, which cannot be estimated with any realistically sized sample, we develop partial guessing metrics including a new variant of guesswork parameterized by an attacker's desired success rate. Our new metric is comparatively easy to approximate and directly relevant for security engineering. By comparing password distributions with a uniform distribution which would provide equivalent security against different forms of guessing attack, we estimate that passwords provide fewer than 10 bits of security against an online, trawling attack, and only about 20 bits of security against an optimal offline dictionary attack. We find surprisingly little variation in guessing difficulty; every identifiable group of users generated a comparably weak password distribution. Security motivations such as the registration of a payment card have no greater impact than demographic factors such as age and nationality. Even proactive efforts to nudge users towards better password choices with graphical feedback make little difference. More surprisingly, even seemingly distant language communities choose the same weak passwords and an attacker never gains more than a factor of 2 efficiency gain by switching from the globally optimal dictionary to a population-specific lists.

AB - We report on the largest corpus of user-chosen passwords ever studied, consisting of anonymized password histograms representing almost 70 million Yahoo! users, mitigating privacy concerns while enabling analysis of dozens of subpopulations based on demographic factors and site usage characteristics. This large data set motivates a thorough statistical treatment of estimating guessing difficulty by sampling from a secret distribution. In place of previously used metrics such as Shannon entropy and guessing entropy, which cannot be estimated with any realistically sized sample, we develop partial guessing metrics including a new variant of guesswork parameterized by an attacker's desired success rate. Our new metric is comparatively easy to approximate and directly relevant for security engineering. By comparing password distributions with a uniform distribution which would provide equivalent security against different forms of guessing attack, we estimate that passwords provide fewer than 10 bits of security against an online, trawling attack, and only about 20 bits of security against an optimal offline dictionary attack. We find surprisingly little variation in guessing difficulty; every identifiable group of users generated a comparably weak password distribution. Security motivations such as the registration of a payment card have no greater impact than demographic factors such as age and nationality. Even proactive efforts to nudge users towards better password choices with graphical feedback make little difference. More surprisingly, even seemingly distant language communities choose the same weak passwords and an attacker never gains more than a factor of 2 efficiency gain by switching from the globally optimal dictionary to a population-specific lists.

KW - authentication

KW - computer security

KW - data mining

KW - information theory

KW - statistics

UR - http://www.scopus.com/inward/record.url?scp=84878356177&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84878356177&partnerID=8YFLogxK

U2 - 10.1109/SP.2012.49

DO - 10.1109/SP.2012.49

M3 - Conference contribution

AN - SCOPUS:84878356177

SN - 9780769546810

SP - 538

EP - 552

BT - Proceedings - 2012 IEEE Symposium on Security and Privacy, S and P 2012

ER -