Warmup Exercise

Updated Fri Jan 19 08:46:47 EST 2024

Introduction

This warmup exercise is meant to help you to learn / relearn your tools. You can do this with basic Unix commandline programs, or with Python and some libraries, or anything else that appeals, in any combinations that you like. You can even do parts by hand; it might be faster than trying to write a program, though at some point manual labor doesn't scale.

Every aspect of this is representative of issues that you are likely to face in working with your own data. For instance, doing things manually instead of writing code is a tradeoff, based on imperfect information. So is deciding whether to learn a new tool to make experiments easier, or to write your own, or to just make do with ones you already know even though they aren't really right for the job.

A sample data file: shakespeare.txt

The file shakespeare.txt, courtesy of Gutenberg.org, contains all of the works of William Shakespeare (except for Edward III, whose authorship is uncertain). I was surprised after downloading the file to see an accurate number for how much Shakespeare wrote. Make your own guess now about how many words he wrote in his lifetime, then download the file and see how good your estimate was.

Now discard everything but the plays. You could do this manually, or by finding where the irrelevant parts start and end and then writing a simple program to remove what you don't want. Use your favorite text editor to look at the file, and tools like grep to find things.

After you remove the poetry, how much is left, as a fraction of the original?

Exercises

Do the following exercises using whatever combination of tools and techniques you like. As you go, make notes of things that worked well, that failed miserably, that were surprising, and so on. No need to submit anything, but be prepared to talk about it extemporaneously in class.

One early choice is whether to split the original file into separate files, one per play, with systematic names so you can use shell wildcards like *.txt to go through all the plays while still being able to easily see what applies to individual ones. I had originally thought that separate files would be better, but in the end just left things in one file; there was enough structure that I could separate the plays when necessary. Your mileage may vary.

Exercise 0

What character set does this file use? This can matter, for example if your tools assume ASCII and the input is not.

Create a list of the unique characters in the input file

Exercise 1

Summary statistics like raw counts of lines and words tell you something (like how much Shakespeare wrote):

Approximately how many lines, words and characters did Shakespeare write in all?
How many of those are blank lines (or alternatively, how many are non-blank)?

Exercise 2

How many plays did he write? You can count them by hand with the table of contents at the beginning of the original file, but do it with a program since you'll need one later on anyway. What text reliably marks the beginning of a play; alternatively, what text appears only once for each play? How can you use that to identify relevant parts of the text?

How many plays are there in this file?
Print the titles

I did this with 4-6 lines of Awk; Python would be a few lines longer.

Exercise 3

What's in each play? Write a program that, for each play, will count acts and scenes.

compute the number of acts
compute the total number of scenes
(ignore miscellany like prologs and epilogs)

It's marginally more work to compute scenes in each act, but a good exercise.

Exercise 4

How many roles are there in each play?

compute the approximate number of dramatis personae in each play

Exercise 5

Count the words (very approximately is good enough) in each play.

Count the words

Exercise 6

Who does the talking? What are the top ten most frequent speaker names?

compute the names of all speakers in all plays
and how many times each name appears

Hint: this is a one-liner if you know basic Unix tools!

Save the results

Put your data into an Excel or Google spreadsheet. Your sheet should have four columns:

title
number of acts
number of scenes
number of personae

Use CSV or TSV format for the columns, so it's easy to separate the titles from the counts.

My code for all of this (including titles) is one program with 20 lines of straightforward Awk. It's about 10-15 lines longer in Python since you have to write an input loop, and while loops are not as compact. The outputs are identical except for one word count.

Graph them

Plot some simple graphs of your results. For example, here are the plays sorted in order of approximate word count with Excel:

and with matplotlib:

You might use a scatterplot to show the relationship between the number of lines and the number of characters (if there is one).

What else?

You might set up a Git repository for your work, as practice.