GeodSoft logo   GeodSoft

Good and Bad Passwords How-To

Review of the Conclusions and Dictionaries Used in a Password Cracking Study
Password Research

How much do we know about how users create passwords? There is a lot of anecdotal evidence but until relatively recently, not much quantitative evidence regarding real user passwords. The post 2000, often quantitative lists of common passwords are not academic studies. The sources and methodologies of the collected passwords may not be representative of what one would find in Windows and Unix OS user accounts versus the web accounts that seem to be the source of most revealed passwords. Various lists of the top n passwords show clear trends, but specific passwords change position, sometimes significantly, on different lists. The largest list, the top 10,000 combines all similar mixed case passwords into one all lower case password.

In the "standard paper on UNIX password security"2, titled "Password Security: A Case History"7, Robert Morris and Ken Thompson describe the characteristics of most of 3289 passwords they had collected over a period of time. 551 were one to three characters, 477 were four letters, 706 were five single case characters and 605 were six lower case characters. "An additional 492 passwords appeared in" various dictionaries and lists. They also said "There was, of course, considerable overlap between the dictionary results and the character string searches. The dictionary search ... produced about one third of the passwords. They did not describe the remaining 14%. This was 1979 and the first line starts with "Password security on the UNIX ... time-sharing system" At the time, and for more than a decade, UNIX passwords were limited to 8 characters; anything longer was simply truncated.

The way I read this, the additional 492 passwords were seven or eight character words, or reversed words, or words with the first letter capitalized. These were the specific transformations that were applied to the dictionary words. About 450 (one third less 492) dictionary words appeared in the one through six character groups.

In 1991, Daniel V. Klien did the only comprehensive password analysis I've seen. The results were published as "Foiling the Cracker: A Survey of, and Improvements to, Password Security"4. He obtained a database of 13,797 user accounts from a variety of sources, and successfully cracked 3340 or 24% of them. The most computationally efficient approach was 130 variations of the account name, user name and other personal information taken directly from the passwd file. This yielded 368 passwords or 2.7% at an efficiency of 2.83%.

Several name lists were used as dictionaries. In aggregate they provided 1043 passwords or 7.6%. The cost/benefit ratio varied dramatically with "common names" being the most productive. The best single source was the dictionary provided in /usr/dict/words. This yielded 1027 passwords or 7.4%. The list "Phrases and Patterns" got 253 or 1.8% with a very good efficiency. This included a somewhat diverse collection compiled by Daniel Klien and others. Examples are 123abc, 4.2bsd, "get lost", gotohell, ibmpc, itty-bitty, xyz.

"Machine Names" also found a significant number, 132 or 1% but with a very low efficiency. This list was created from an /etc/hosts file. It's worth noting, it has a significant number of ordinary words and names in it, as well hundreds of demo9999 names. It would be interesting to know how many should have been in another dictionary. Several of the lists were compiled by Dan Klien and associates. Some are surprisingly small. The "Movies and Actors" list is very small (118 entries, nearly always one word per line) and eclectic, but resulted in 12 passwords. One can only speculate, but it seems very likely that larger, more comprehensive, and higher quality lists, would yield significantly more found passwords, perhaps at a poorer cost benefit ratio.

The words from the lists were each manipulated using 14 to 17 methods similar to those described in the previous don't list. Additional capitalization variations were performed.

An Analysis of Daniel Klein's Dictionaries

It's worth looking at the dictionaries used by Daniel Klien is some detail. There were two general word dictionaries. One was /usr/dict/words, the standard UNIX dictionary used for spell checking. This was a small general purpose dictionary. Thus, most of the words in it will be common, compared to some of the words found in a collegiate dictionary, or most in an unabridged dictionary. There were 3212 miscellaneous words from the "junk" dictionary, that did not appear in the other dictionaries. Some of these were more obscure words, but others are character sequences that do not appear to be words from any language I've ever seen; the comment admits this list contains many junk words.

The 19,683 word, standard dictionary, lead to 1027 passwords at a cost/benefit ratio of 0.052; the miscellaneous words resulted in 54 passwords at a cost/benefit ratio of 0.017. The results support three conclusions that are consistent with common sense and observation of common password lists. Many people use ordinary words as the basis for their passwords. Of these most choose words that quickly come to mind, i.e., common words. A smaller group tries to find "obscure" words on which to base their password. It would be interesting to know what results an unabridged dictionary would have produced if used against the same account and password database.

In Daniel Klein's paper, "Common names" were identified as the second most productive dictionary with 2239 names yielding 548 passwords at a 0.245 cost/benefit ratio. This was the fourth best cost/benefit ratio of 27 dictionaries used. It's by far the largest of the "high yield" dictionaries, but still less than one eighth the size of the /usr/dict/words list, which was the only single list to yield more passwords. In short, common first names are by a signifcant amount, the most frequent basis for passwords.

The actual contents of this dictionary are very interesting. There is a comment at the beginning of the dictionary: "First names garnered from a number of password files. We get a good hit rate from these. Probably could be culled somewhat. By Daniel Klein." Reviewing these, though the list certainly contains many, perhaps mostly common names, I see a significant number of names I don't recognize. The "could be culled" comment supports this.

The Census Bureau created three exceptional quality common name lists. There is one for female first names, one for male first names and one for last names. Each list is ordered by the frequency that the name is used within the U.S. population in 1990. Each list includes as many names as necessary, so that 90% of the U.S. population has their name listed. There are 1219 male names, 4275 female names and 88,799 last names. 3.3% of the men in the U.S. are named James and nearly as many named John. 2.6% of the women are named Mary but this is more than two and a half times as many as Patricia, the second most common female name. 1% of the population is named Smith and .8% Johnson.. The Census Bureau did a new last name list from the 2000 census, but no first name lists.

Many of Daniel Klein's "Common names" do not appear in any Census Bureau common name list. It's unlikely the odd names in the common names list, got good, if any results. He stated that he requested password lists from "around the United States and Great Britain." It's not likely his password list had a strong local bias, like a single Unix system in one of the four states bordering Mexico might have a higher than average number of Hispanic names. Though they were based on the 1990 census, these name lists were not created until 1996, so there is no way that Daniel Klein could have used them. Similar lists from other countries would most likely have a very high return in the country of origin.

Actually it's fair to say many of Daniel Klein's common names are not at all common. The Census Bureau lists each cover 90% of the U.S. population (in 1990). His common name list had 2273 names in it, but only 676 that were in the Census female names list, and 543 in the male names list, and 211 were common to both lists, so only 1008 match a Census list and were unique out of 2273. It is very surprising to study the Censu lists and see how many women have men's names and vice versa. I just obtained (Jan. 30, 2014) the Social Security Addministration's lists of baby's names from 1880 through 2012. They include every name which was used for each sex five or more times in each year. These contained 63,246 girl's names and 38,015 boy's names. I dropped the less used names and used 29,952 girl's names and 19,685 boy's. I had no idea how these would match up with Daniel Klein's "Common names." They matched 1036 girl's names and 1267 boy's names, but 845 were in both lists, so only 1458 were unique. 815 were unmatchted even when compared against almost 50,000 statistcally common to fairly infrequent names (used 50 or more times in 133 years).

I have little doubt that if the top 800 male and 1500 female names were taken from either the Census or SSA lists and run against Daniel Klein's password list that either would get significantly better results than his common name list. If the full Census lists or my 50,000 name selection from the SSA lists were used, I'm sure many more passwords would have been found than the 548 found by the "Common name" list. Even with over 800 quite unusual names, this list still had the fourth highest "Cost/Benefit Ratio." Using the full Census list I would not be surprised if the efficiency was lower. Using 50,000 names from the SSA lists, the efficiency would surely be much lower.

Daniel Klien's 62,727 word dictionary was a good first step in building a password cracking dictionary. Today, larger, more comprehensive, and more consistent quality lists are available. Some of the specific dictionaries that he created should be considered for inclusion in any cracking dictionary that is to be built. The character sequences and number sequences contained in the similarly named dictionaries probably belong in any cracking dictionary, perhaps in expanded form.

Dictionary List

Daniel Klein's dictionaries do not seem to be available at their original or previous location but are now available here. Matching the physical file names to the "Types of Password" which Daniel Klein used was not always obvious. You need to be a Unix administrator to realize that "Machine names" and etc-hosts are a match. After matching the obvious file names to password type, I used counts to help match the remaining five. The only exact match on counts was Mnemonics and abbr and I have no idea what the file has to do with either. The counts in the rest were within 10, usually 5, and I did check all those I thought were obvious, to be sure I had not made a mistake. The table below shows the password type matched to the file name. The counts are from Daniel Klien's paper, not the file counts.

Type of PasswordCountDictionary file
User/account name 130 not available
Character sequence 866 chars
Numbers 450 numbers
Chinese 398 chinese
Place names 665 places
Common names 2268 first-names
Female names 4955 female-names
Male names 3901 male-names
Uncommon names 5559 other-names
Myths & legends 1357 myths-legends
Shakespearean 650 shakespeare
Sports terms 247 sports
Science fiction 772 sf
Movies and actors 118 movies
Cartoons 133 cartoon
Famous people 509 famous
Phrases and patterns 998 phrases
Surnames 160 surnames
Biology 59 biology
/usr/dict/words 24474 not available
Machine names 12983 etc-hosts
Mnemonics 14 abbr
King James bible 13062 kjbible
Miscellaneous words 8146 junk
Yiddish words 69 yiddish
Asteroids 3459 asteroids

The dictionaries were used in the order listed, and presumbably duplicates were removed in that order also. Unfortunately the two most important files are missing. The data used for user and account names was taken directly from the password file, and was by far the most efficient in finding passwords. The 130 number is somewhat misleading as it applies to the "User/account name". Depending on the data available, up to 130 variations / combinations were made on each account. Because the data is specific to each of the 13,797 accounts in the password file, it's highly unlikely that it was used to remove any duplicates from files used after it.

/usr/dict/words (or /usr/share/dict/words) is a standard file on most Unix and Unix like systems, but current files are of little value; the ones on my Linux systems are nearly 20 times as large. These files typically grow with each release of each Unix variant. Without knowing, and having access to, the specific Unix version Daniel Klein was on nearly 23 years ago, there is no way to know what words were in it. Because it likely had a large influence on the removal of duplicates from all subsequent files, we can never know which words from those files were actually used. Refer to the paper to see how the Westernized Chinese syllables were used

transparent spacer

Top of Page - Site Map

Copyright © 2000 - 2014 by George Shaffer. This material may be distributed only subject to the terms and conditions set forth in http://GeodSoft.com/terms.htm (or http://GeodSoft.com/cgi-bin/terms.pl). These terms are subject to change. Distribution is subject to the current terms, or at the choice of the distributor, those in an earlier, digitally signed electronic copy of http://GeodSoft.com/terms.htm (or cgi-bin/terms.pl) from the time of the distribution. Distribution of substantively modified versions of GeodSoft content is prohibited without the explicit written permission of George Shaffer. Distribution of the work or derivatives of the work, in whole or in part, for commercial purposes is prohibited unless prior written permission is obtained from George Shaffer. Distribution in accordance with these terms, for unrestricted and uncompensated public access, non profit, or internal company use is allowed.

 
Home >
How-To >
Good Passwords >
password_research.htm


What's New
How-To
Opinion
Book
                                       
Email address

Copyright © 2000-2014, George Shaffer. Terms and Conditions of Use.