Updated Fri Feb 14 07:20:54 EST 2025
Count the lines, words and characters.
$ wc dog.big.csv
616891 1316619 38715790 dog.big.csv
$ awk --csv '{ c+=length; w+=NF } END { print NR, w, c }' dog.big.csv
616891 4935128 38098807
Print just the dog names.
$ awk --csv '{print $1}' dog.big.csv
AnimalName
PAIGE
YOGI
ALI
QUEEN
...
Count the number of occurrences of each name,
and display them in decreasing numerical order.
$ awk --csv '{print $1}' dog.big.csv | sort | uniq -c | sort -n | tail
3576 LUCY
4029 ROCKY
4201 LOLA
4636 COCO
4951 CHARLIE
4985 LUNA
5774 MAX
6833 BELLA
7316 NAME NOT PROVIDED
14970 UNKNOWN
Find all dogs named after current and still-alive former US presidents.
Print these names in order of popularity.
$ awk --csv '
$1 ~ /^(TRUMP|CLINTON|BUSH|BIDEN|OBAMA)$/ { s[$1]++ }
END { for (i in s) print s[i], i }
' dog.big.csv | sort -nr
16 CLINTON
10 TRUMP
10 BUSH
2 OBAMA
1 BIDEN
Are there dogs who share a name with a significant other, past or present?
$ awk --csv '$1 == "MEG"' dog.big.csv
MEG,F,2014,Poodle,10017,05/29/2015,05/29/2016,2016
MEG,F,2015,"Poodle, Standard",11211,10/22/2015,10/22/2016,2016
MEG,F,2001,Unknown,10023,02/27/2016,02/28/2017,2016
MEG,F,2016,Australian Cattledog,10028,12/24/2016,12/24/2017,2016
MEG,F,2009,French Bulldog,11103,02/07/2017,02/07/2018,2017
...
Think of or look for or stumble into real or apparent errors in the dog data.
Identify at least half a dozen significant kinds of errors, with specific examples.
Gender???
$ awk --csv '$2 != "M" && $2 != "F"' dog.big.csv | wc
22 50 1413
Field count???
$ awk --csv '{print NF}' dog.big.csv | sort | uniq -c
616891 8
Date of birth???
$ awk --csv '{print $3}' dog.big.csv | sort | uniq -c | sort -n
1 1912
1 1920
1 1921
1 1922
1 1923
...
Zipcode???
$ awk --csv '{print $5}' dog.big.csv | sort -u -n
ZipCode
100
121
687
923
1001
...
98363
98433
99202
99508
Capitalization? Spelling?
$ awk --csv '{print $1}' dog.big.csv | grep -i unknown | sort | uniq -c
14970 UNKNOWN
1 UNKNOWNALA
1 UNKNOWNCOCK
2 UNKNOWNMOR
1 UNKNOWNMS
1 UNKNOWNN
2 UNKNOWNNMINI
1 UNKNOWNSHIH
20 Unknown
What can go wrong?
We already backed into one example: dogs with unknown or unspecified names.
Think of some consistency properties that you think this
data should have?
Are there conservation laws, where data items of a certain type
should always add up to some value?
Is the data in the right format?
Remember Wouter's comments on character sets?
All data has errors. Any computation that blindly relies on its data being correct is doomed. You must always check your data for validity.
Checks like these are the tip of an iceberg of possibilities. There are plenty of other weirdnesses that should be understood before drawing any serious conclusions from the data. The dog names are not sorted, and in fact it's not clear what the sort order is.
What is the most likely sort order? Can you validate that or identify exceptions to the sort order?
There are also many duplicates that arise from annual registration; it looks like one could identify the same dog by the combination of name, gender, breed, birth year and a sequence of registrations in chronological order, though this is starting to get a bit complicated.
Invent an algorithm that will identify and compress records that appear to be about the same dog. How many unique dogs are there?
There are a fair number of records that are absolutely identical.
Find and print duplicated records. $ sort dog.big.csv | uniq -c | awk '$1 > 1' | wc 39646 128264 2721787
What kinds of dogs are there?
$ awk --csv '{x[$4]++}; END {for (i in x) print x[i], i}' dog.big.csv | sort -nr
55254 Unknown
35363 Yorkshire Terrier
32432 Shih Tzu
24634 Chihuahua
18902 Labrador Retriever
...
1 mutt from shelter. looks like border collie
...
1 AM PIT BULL MIX
1 AM ESKIMO / MALTESE
1 ALASKAN KLEE KAI
1 AKITA MIX
1 AFFENPINSCHER
1 6152012
Flaky and/or mysterious data is typical of most real-world data sets. It's valuable to have a skeptical mindset and a collection of tools for checking and verifying before drawing too many conclusions.
Come to class prepared to talk about what you did and what more might be done.