
By: maggiee
September 1, 2016
The Guessability of Passwords

By: maggiee
September 1, 2016
Recently, over a family dinner, my aunt asked me how she could choose passwords that are secure. I responded with the usual advice: no words, especially not names; use a long passphrase, length really does matter; and so on. Until yesterday, though, I was unfamiliar with a formal metric for password “guessability”. In the course of my research I happened to stumble across a fascinating Google study on account recovery through security questions, and more broadly the work of Joseph Bonneau, a postdoctoral scholar in the Applied Cryptography group at Stanford who wrote his dissertation, “Guessing human-chosen secrets,” on password authentication. Several findings are extremely interesting, and have strong implications for security.
The guessing of a password is modeled as a random draw from some password distribution. If you are familiar with Shannon entropy (if not, check out my guide here), you might consider that as a useful measure of uncertainty, but that actually measures a slightly different quantity (the number of subset queries needed to identify the password in question, as opposed to the likelihood of guessing the correct password), which has no direct correlation. Another idea is called guessing entropy, the expected number of guesses until the correct password is found, which is closer to the metric we seek but is disproportionately affected by few users with very strong passwords (128-bit pseudorandom hexadecimal strings). In Bonneau's research, 20 Yahoo! users of a nearly 70 million sample who used such passwords drove up the guessing entropy to 2^106, which is clearly not representative of most of the dataset. Instead, Bonneau describes and uses several “partial guessing metrics:” β-success-rate, the expected number of successes given guesses per account; α-work-factor, the number of guesses needed to break accounts; and, combining the previous two, α-guesswork, the number of guesses per account to achieve a success rate. I will omit the mathematical details here and merely present what I find to be the most salient results.
From the Yahoo! corpus, Bonneau looked at relative password strength across the population and several sub-populations. “There is a general trend towards better password selection with users’ age,” he writes, but age did not have nearly the effect of another factor, language. German and Korean-speaking users had the strongest passwords, where Indonesian-speaking users had the weakest in general. Users who actively change their passwords, who log in from multiple locations, and who store a lot of data with Yahoo! all selected stronger passwords. Users who had their accounts compromised did not choose significantly stronger passwords after a manual reset, nor did those who enrolled with a form that showed a “graphical indicator of password strength” as compared to those who enrolled with a form without password guidance or a minimum length requirement.
In the Google study, nearly 20% of English-speaking users’ answers to “favorite food” were guessable with a single try. Naturally, the answer distributions themselves are not included in the literature, but I’m willing to bet “pizza” was a pretty successful attempt for that question. Security questions are more insecure the smaller the answer space; in fact roughly 40% of questions used in practice have “trivially small” answer spaces, as in a limited number of potential answers. Even more problematic are strategies like United's new system, where the user must choose from a discrete list of options. Outside of those cases, much of the information requested is widely available through social media profiles. For example, it is easy to imagine discovering someone’s mother’s maiden name by scrolling through their Facebook friends. Even when we consider untargeted guessing on a large scale, adversaries can often achieve a high rate of success if the true distribution is readily available. Responses to “Best friend’s name” and “first teacher’s name,” for example, occur in similar frequencies as first names and surnames do in the population. Furthermore, some questions are less secure given a particular cultural context. The authors noted that they were able to correctly answer “place of birth” for 12% of Korean-speaking users within one try (as compared to 1.3% of English-speaking users) and almost 90% within 1000 tries (as compared to about 60% for English-speaking). They attributed this phenomenon to the fact that the Korean population is highly concentrated in cities. The vendors performing the authentication might not take these factors into account, but persistent attackers certainly will.
In what the authors described as their most surprising observation, even for questions where everyone should have a unique answer, such as “frequent flyer number” and “phone number,” there was not a uniform answer distribution. Some 4.2% of users claimed to have the same frequent-flyer number. Now, either the airlines have failed to notice a rather large mistake on their part, or people are lying. There are a number of reasons why users copped to being dishonest, most commonly to make their response harder to guess, or easier to remember. Unfortunately untruthful answers actually did neither. The people who altered their responses tended to do so in the same way (i.e. a frequent-flyer number of “123456”), making it less secure, but also had a harder time remembering their answer. After a few days following the original fake input, users who were dishonest had trouble figuring out what false response they might have given.
It is an axiom among security experts that the weakest part of any system is the human component, and human-chosen passwords are no exception. And although some alternatives are favored over security questions, such as SMS and email recovery, it seems that security questions will not be totally removed from authentication for quite some time, so it's important to understand the limitations of these approaches.