Updated Sat Feb 8 10:48:36 EST 2025
In class we will talk about AWK, a simple programming language that works beautifully with line-oriented data. You can learn enough AWK in a few minutes to do useful computing, and an hour or two of practice will make you almost an expert. Few other languages will give you so much bang for the buck.
But for now, you can do these tasks in any way you want, though a mix and match of Python and standard Unix tools will likely be easiest and most effective.
Line-at-a-time data as seen in spreadsheets is a natural for Unix tools, many of which were processing lines decades before spreadsheets were invented. Examples include grep, wc, diff, which compares files by lines, and sort, which sorts lines into order.
For this exercise, the data set is a much bigger and richer version of the dog names that we played with in the warmup exercise. It comes from the New York City Department of Health and Mental Hygiene web site:
https://data.cityofnewyork.us/Health/NYC-Dog-Licensing-Dataset/nu7n-tubp/about_dataYou can download the latest version as a .xls file, which you will then have to convert to a .csv. In theory it's updated annually; the most recent version is from February 2024.
Download the .xls fileBefore you can do much useful with it, you will have to convert it to a better format: .csv. Excel or Google Sheets will do this for you.
Convert the .xls data to a .csv file.
CSV stands for comma-separated values. It's a more or less standard format that is read and written by Excel and Google Sheets, and by libraries in lots of other tools and programming languages. A CSV file has a number of rows; each row has the same number of fields; any given field represents information of a particular type for all rows. A lot of digital humanities data is stored in CSV files, so it's highly likely that you'll wind up having to handle such things.
In one very common convention, the first row names the fields or columns, and the real data starts in the second row. This data has eight fields, all of which suggest interesting avenues to explore. You might take a moment right now to think about what questions to ask.
In CSV, fields are separated by commas (which, not surprisingly, relates to the name). A field that contains a comma must be quoted, fields may be quoted even if they don't contain commas, and a double quote within a field is represented by a doubled double-quote character, as illustrated here:
field1,,field3, field 2 and field 4 are empty field1,"field2, etc","field3",field4 field 2 contains a comma "a double quote "" looks like this" field 1 contains a double quote """" field 1 is a single double-quote character """,""" field 1 is ","CSV format is not rigorously defined, so there are some ambiguities with line breaks inside fields, erroneous input, and so on. In spite of such minor problems, CSV is widely used.
Count the lines, words and characters. Print just the dog names. Count the number of occurrences of each name, and display them in decreasing numerical order.
Do you notice anything unusual about the most common names? The top two "names," comprising about 3 percent of the whole, are either not provided or unknown. This is a nice illustration of some important facts about real data. First, it is often incomplete. Second, there are often inconsistencies. Is there a meaningful difference between "name not provided" and "unknown"? How did those two apparently identical categories arise? Might there be other categories of unspecified names? What other anomalies might we discover?
Grep is useful for simple queries, as we saw before.
Find all dogs named after current and still-alive former US presidents. Print these names in order of popularity. Are there dogs who share a name with a significant other, past or present?
Think of or look for or stumble into real or apparent errors in the dog data. Identify at least half a dozen significant kinds of errors, with specific examples.What can go wrong? We already backed into one example: dogs with unknown or unspecified names. Think of some consistency properties that you think this data should have? Are there conservation laws, where data items of a certain type should always add up to some value? Is the data in the right format? Remember Wouter's comments on character sets?
All data has errors. Any computation that blindly relies on its data being correct is doomed. You must always check your data for validity.
Checks like these are the tip of an iceberg of possibilities. There are plenty of other weirdnesses that should be understood before drawing any serious conclusions from the data. The dog names are not sorted, and in fact it's not clear what the sort order is.
What is the most likely sort order? Can you validate that or identify exceptions to the sort order?
There are also many duplicates that arise from annual registration; it looks like one could identify the same dog by the combination of name, gender, breed, birth year and a sequence of registrations in chronological order, though this is starting to get a bit complicated.
Invent an algorithm that will identify and compress records that appear to be about the same dog. How many unique dogs are there?
There are a handful of records that are absolutely identical.
Find and print duplicated records.
Flaky and/or mysterious data is typical of most real-world data sets. It's valuable to have a skeptical mindset and a collection of tools for checking and verifying before drawing too many conclusions.
Come to class prepared to talk about what you did and what more might be done.