Real-World Data Sets
Here is a list of real-world data sets collected from the web.
|
|
|
|
|
|||||
|---|---|---|---|---|---|---|---|---|---|
| dickens.txt | 30M | nearly the complete works of Charles Dickens | text file | Project Gutenberg | |||||
| magna-carta.txt | 78K | Magna Carta | text file | Project Gutenberg | |||||
| war+peace.txt | 3M | War and Peace | text file | Project Gutenberg | |||||
| chromosome4.txt | 10M | human chromosome 4 | text file | Project Gutenberg | |||||
| chromosome11.txt | 7M | human chromosome 11 | text file | Project Gutenberg | |||||
| ecoli.txt | 4M | ecoli genome | text file | Project Gutenberg | |||||
| world192.txt | 2M | World Factbook 1992 | text file | Project Gutenberg | |||||
| tale.txt | 779K | Tale of Two Cities | text file | Project Gutenberg | |||||
| TomSawyer.txt | 406K | Tom Sawyer | text file | Project Gutenberg | |||||
| bible.txt | 4M | The Bible | text file | ||||||
| mobydick.txt | 1M | Moby Dick | text file | ||||||
| aesop.txt | 187K | Aesop's Fables | text file | ||||||
| manifesto.txt | 71K | Communist Manifesto | text file | ||||||
| lilwomen.txt | 1018K | Little Women | text file | ||||||
| muchado.txt | 121K | Much Ado About Nothing | text file | ||||||
| amendments.txt | 18K | amendments to constitution | text file | ||||||
| bush-kerry1.txt | 82K | Bush-Kerry debate 1 | text file | debates.org | |||||
| bush-kerry2.txt | 92K | Bush-Kerry debate 2 | text file | debates.org | |||||
| bush-kerry3.txt | 85K | Bush-Kerry debate 3 | text file | debates.org | |||||
| obama-mccain1.txt | 90K | Obama-McCain debate 1 | text file | debates.org | |||||
| obama-mccain2.txt | 88K | Obama-McCain debate 2 | text file | debates.org | |||||
| obama-mccain3.txt | 85K | Obama-McCain debate 3 | text file | debates.org | |||||
| pi-10million.txt | 10M | first 10 million digits of pi | text file | ||||||
| pi-1million.txt | 977K | first 1 million digits of pi | text file | ||||||
| elements.csv | 5K | periodic table of elements | CSV | ||||||
| surnames.csv | 2M | 88,799 surnames from US Census | CSV | 1990 US Census | |||||
| ip-by-country.csv | 5M | IP address ranges by country | CSV | MaxMind | |||||
| dma.csv | 4K | designated market area code | CSV | MaxMind | |||||
| misspellings.csv | 46K | common misspellings | CSV | Wikipeda | |||||
| starbucks.csv | 619K | latitude and longitude of Starbucks | CSV | POI Factory | |||||
| wendys.csv | 538K | latitude and longitude of Wendys | CSV | POI Factory | |||||
| mcdonalds.csv | 1M | latitude and longitude of McDonalds | CSV | POI Factory | |||||
| burgerking.csv | 662K | latitude and longitude of Burger Kings | CSV | POI Factory | |||||
| walmart.csv | 468K | latitude and longitude of Walmarts | CSV | POI Factory | |||||
| homedepot.csv | 216K | latitude and longitude of Home Depots | CSV | POI Factory | |||||
| dairyqueen.csv | 459K | latitude and longitude of Dairy Queens | CSV | POI Factory | |||||
| pizzahut.csv | 551K | latitude and longitude of Pizza Huts | CSV | POI Factory | |||||
| zips1990.csv | 2M | latitude and longitude by zip code in 1990 | CSV | Tiger | |||||
| zips1990-full.csv | 2M | latitude and longitude by zip code in 1990 | CSV | Tiger | |||||
| zips2000.csv | 964K | latitude and longitude by zip code in 2000 | CSV | Tiger | |||||
| calories.csv | 43K | calories for various food items | CSV | ||||||
| DJIA.csv | 1M | Dow Jones Industrial Average | CSV | ||||||
| morse.csv | 242 | Morse code | CSV | ||||||
| amino.csv | 1K | amino acids | CSV | ||||||
| names.csv | 103K | names and their meanings | CSV | ||||||
| codes.csv | 820 | states and FIPS codes | CSV | Tiger | |||||
| phone-na.csv | 28K | North American telephone codes | CSV | ||||||
| phone-international.csv | 3K | international telephone codes | CSV | ||||||
| airports.csv | 5K | airport codes | CSV | ||||||
| psychiatric.csv | 18K | psychiatric disorders and DSM codes | CSV | allpsych.com | |||||
| fortune1000.csv | 23K | Forunte 1000 companies | CSV | ||||||
| language.csv | 196K | common words translated in 15 languaages | CSV | ||||||
| synsets.txt | 8M | Wordnet synonym sets | CSV | WordNet | |||||
| hypernyms.txt | 952K | Wordnet hypernym relations | CSV | WordNet | |||||
| countries.csv | 7K | countries, capitals, and country codes | CSV | ubuntu.com | |||||
| iso3166.csv | 4K | ISO 3166 country codes | CSV | MaxMind | |||||
| fips10_4.csv | 73K | FIPS 10-4 subcountry codes | CSV | MaxMind | |||||
| bnc-wordfreq.csv | 122K | frequency of words in British National Corpus | CSV | Adam Kilgarriff | |||||
| upc-glns.csv | 5M | manufacturers by 13-digit GLN | CSV | upcdatabase.com | |||||
| upc-items.csv | 58M | items by UPC code | CSV | upcdatabase.com | |||||
| sdss174052.csv | 20M | .1% of Sloan Digital Sky galaxy objects | CSV | Sloan Digital Sky | |||||
| sdss1738478.csv | 201M | 1% of Sloan Digital Sky galaxy objects | CSV | Sloan Digital Sky | |||||
| sdss6949386.csv | 804M | 4% of Sloan Digital Sky galaxy objects | CSV | Sloan Digital Sky | |||||
| comets.csv | 3K | comets | CSV | Home Planet | |||||
| meteors.csv | 3K | meteor showers | CSV | Home Planet | |||||
| mktsymbols.txt | 4M | market symbols | tab-separated file | ||||||
| movies-hero.txt | 44K | movies with hero in the title | / delimited | IMDb | |||||
| movies-mpaa.txt | 6M | movies rated by the MPAA | / delimited | IMDb | |||||
| movies-top-grossing.txt | 152K | top-grossing movies | / delimited | IMDb | |||||
| contiguous-usa.dat | 642 | adjacencies between contiguous US and DC | vertex pairs | Stanford GraphBase | |||||
| usa13509.txt | 319K | latitude and longitude of 13,509 cities in US | latitude, longitude pairs | TSPLIB | |||||
| leipzig/leipzig100k.txt | 12M | 100K random sentences | one sentence per line | Leipzig Corpora | |||||
| leipzig/leipzig300k.txt | 37M | 300K random sentences | one sentence per line | Leipzig Corpora | |||||
| leipzig/leipzig1m.txt | 124M | 1 million random sentences | one sentence per line | Leipzig Corpora | |||||
| words.txt | 164K | 20,068 words | one word per line | ||||||
| wordlist.txt | 2M | 224,714 words | one word per line | ||||||
| words.utf-8.txt | 6M | 645,288 words | one word per line | ||||||
| words.shakespeare.txt | 228K | words in the complete works of Shakespeare | one word per line | ||||||
| ospd.txt | 600K | official Scrabble player's dictionary | one word per line | ||||||
| web2.txt | 2M | Webster's NI2 dictionray | one word per line | Webster | |||||
| 1000words.txt | 7K | 1000 most common words | one word per line | ||||||
| words5-knuth.txt | 34K | 5757 five-letter words | one word per line | Stanford GraphBase | |||||
| stopwords.txt | 3K | words ignored in Wikipedia search | one word per line | MySQL | |||||
| commonwords.txt | 784K | 74,202 common terms | one term per line | ||||||
| california-gov.txt | 2K | list of candidadtes for California governor | one candidate per line | ||||||
| fips55-all.txt | 29M | codes for named populated places | fixed-width fields | FIPS 55-3 | |||||
| fips55-pa.txt | 2M | codes for named populated places | fixed-width fields | FIPS 55-3 | |||||
| bostonmetro.txt | 5K | Boston Metro | fixed-width fields | MIT | |||||
| latlng.txt | 1M | latitude and longitude of 25,000+ places in US | fixed-width fields | Tiger |
Internet movie database.
State boundaries by county.
Presidential election results.